Human4DiT: 360-degree Human Video Generation with 4D Diffusion Transformer

Published 27 May 2024 in cs.CV | (2405.17405v2)

Abstract: We present a novel approach for generating 360-degree high-quality, spatio-temporally coherent human videos from a single image. Our framework combines the strengths of diffusion transformers for capturing global correlations across viewpoints and time, and CNNs for accurate condition injection. The core is a hierarchical 4D transformer architecture that factorizes self-attention across views, time steps, and spatial dimensions, enabling efficient modeling of the 4D space. Precise conditioning is achieved by injecting human identity, camera parameters, and temporal signals into the respective transformers. To train this model, we collect a multi-dimensional dataset spanning images, videos, multi-view data, and limited 4D footage, along with a tailored multi-dimensional training strategy. Our approach overcomes the limitations of previous methods based on generative adversarial networks or vanilla diffusion models, which struggle with complex motions, viewpoint changes, and generalization. Through extensive experiments, we demonstrate our method's ability to synthesize 360-degree realistic, coherent human motion videos, paving the way for advanced multimedia applications in areas such as virtual reality and animation.

Abstract PDF HTML Upgrade to Chat

References (50)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces a transformer-based diffusion approach that generates high-fidelity 360-degree human videos from a single image with arbitrary camera views.
It employs a cascaded 4D transformer architecture that factorizes attention across spatial, temporal, and view dimensions, enhancing multi-view coherence.
Experimental results demonstrate significant improvements in PSNR, SSIM, and FVD, validating its effectiveness for free-view video synthesis.

4D Diffusion Transformer for Free-View Human Video Generation

Introduction

Human4DiT proposes a unified, transformer-based diffusion framework for generating high-fidelity, spatio-temporally coherent human videos from a single reference image, under arbitrary camera viewpoints and with support for complex articulated motions. The methodology synergizes the precision of local condition injection via UNets with the global relational modeling capacities of transformers, introducing a novel and computationally efficient 4D diffusion transformer architecture factorizing attention across spatial, temporal, and view axes. The system is trained on a multi-modal, multi-dimensional dataset and enables synthesis of monocular, multi-view, 3D static, and unconstrained free-view human videos. The framework sets a new standard for human-centric video generation in terms of multi-view controllability, coherence, and visual fidelity.

Architecture and Method

The core component is a cascaded 4D diffusion transformer that processes the latent spatio-temporal-view sequence in the denoising diffusion paradigm. Input consists of a reference image, dynamic SMPL body parameters over time, and camera pose parameters. UNet modules preprocess these signals into pixel-aligned feature embeddings, which are then tokenized and passed into the transformer sequence. The model factorizes global attention via distinct transformer blocks:

2D Image Transformer: Attends over $(H \times W)$ per-frame spatial tokens for local context.
View Transformer: Models cross-view correlations by grouping spatial tokens across viewpoints, with explicit injection of camera pose features.
Temporal Transformer: Aggregates long-range temporal dependencies by modeling tokens across time for every spatial-view combination.

These blocks are sequentially applied and repeated in depth, forming a cascaded 4D transformer block. This factorization decomposes the $O(N^4)$ attention complexity, enabling scalability with respect to spatial, temporal, and view axes.

Figure 1: Pipeline schematic of Human4DiT, showing the cascaded architecture with explicit spatial, view, and temporal transformer components, and multimodal control injection.

Conditioning modules inject human identity embeddings (via UNet and CLIP), frame-level SMPL normal maps for articulated pose, temporally encoded frame indices, and view-wise positional encodings derived from camera extrinsics. All control signals are mapped to the transformer’s latent token embedding space.

Dataset and Training Paradigm

Human4DiT is trained on a curated, large-scale multi-modal dataset comprising:

Images for 2D identity feature learning
Monocular, multi-view, and 3D static videos
A limited set of fully-annotated 4D dynamic scans (view, time, pose)

Training utilizes a modality-aware mixed strategy: the respective transformer blocks are activated according to the available modalities, maximizing supervision for each subdimension (spatial, temporal, view). This allows efficient utilization of partial datasets and robust fusion across modalities for generalizable 4D motion generation.

Spatial-Temporal Sampling for Free-View Generation

To overcome the inherent window-size/memory bottleneck of 4D transformer blocks, free-view long motion synthesis is implemented via a two-strategy spatio-temporal sampling procedure:

Temporal-First Pass: Generates temporally extended monocular sequences with minimal view span, establishing long-term coherence.
View-Windowed Pass: Refines multi-view consistency by processing larger spatial-view windows in shorter temporal fragments.

Predictions from the two strategies are merged in the denoising loop, balancing between global view consistency and detailed temporal alignment. This enables generation of extended 360-degree free-viewpoint sequences without exhaustively attending over prohibitively sized 4D tokens.

Experimental Results

Comprehensive evaluations benchmark Human4DiT against SOTA video generation baselines, including diffusion approaches (Champ, MagicAnimate, AnimateAnyone, Disco). The results demonstrate:

Substantial gains in reconstruction metrics: On monocular video synthesis, Human4DiT achieves PSNR 26.12, SSIM 0.888, LPIPS 0.116, FVD 237.4, surpassing all baselines (maximum relative improvement: $\sim\text{+3.2}$ PSNR, $\sim\text{+0.06}$ SSIM, $\sim\text{–0.06}$ LPIPS, $\sim\text{–122}$ FVD).
Multi-view and 4D Consistency: The model consistently delivers higher cross-view and 4D dynamic realism (e.g., Free-view PSNR 25.02, SSIM 0.947, LPIPS 0.062, FVD 234.8).
Ablation validates the necessity of view transformer blocks: Disabling view transformers leads to significant degradations on all multi-view metrics.
Figure 2: Qualitative comparison for monocular video generation highlighting enhanced temporal and spatial coherence in Human4DiT outputs.

Qualitative analyses further reveal generation that is notably free of typical artifacts (jitter, identity drift, multi-face errors), preserving high-fidelity human details and articulated motion across both temporal and viewpoint axes.

Discussion and Implications

Human4DiT establishes that transformer-based diffusion models, when armed with computationally efficient factorized attention and multimodal conditioning, can overcome the limitations of UNet-centric architectures in global coherence and viewpoint control for human video generation. The explicit modeling of high-dimensional spatial-temporal-view correlations yields strong generalization to challenging tasks (e.g., arbitrarily long, 360-degree, identity-preserving motion synthesis from a single image), expanding applicability in virtual reality, digital humans, and content authoring.

The methodology advances the field in two major aspects: (1) efficient global modeling of articulated, view-dependent human motion without explicit 3D reconstruction, and (2) a scalable, modality-agnostic training regimen capable of leveraging heterogeneous, partially annotated datasets. However, the absence of explicit 4D geometric representations implies some failure modes in subtle structure and occlusion reasoning (notably for fingers and accessories).

Conclusion

Human4DiT (2405.17405) demonstrates a rigorously engineered transformer-based diffusion approach for free-view, temporally consistent, and identity-preserving human video synthesis. Through its cascaded 4D transformer architecture and modality-adaptive learning paradigm, the system delivers empirically superior results across diverse generation scenarios. The framework’s design principles and observed limitations inform future research on integrating explicit 3D scene awareness and finer structure modeling, with potential implications across volumetric video generation, digital avatars, and dynamic scene synthesis.