Papers
Topics
Authors
Recent
Search
2000 character limit reached

Human4DiT: 360-degree Human Video Generation with 4D Diffusion Transformer

Published 27 May 2024 in cs.CV | (2405.17405v2)

Abstract: We present a novel approach for generating 360-degree high-quality, spatio-temporally coherent human videos from a single image. Our framework combines the strengths of diffusion transformers for capturing global correlations across viewpoints and time, and CNNs for accurate condition injection. The core is a hierarchical 4D transformer architecture that factorizes self-attention across views, time steps, and spatial dimensions, enabling efficient modeling of the 4D space. Precise conditioning is achieved by injecting human identity, camera parameters, and temporal signals into the respective transformers. To train this model, we collect a multi-dimensional dataset spanning images, videos, multi-view data, and limited 4D footage, along with a tailored multi-dimensional training strategy. Our approach overcomes the limitations of previous methods based on generative adversarial networks or vanilla diffusion models, which struggle with complex motions, viewpoint changes, and generalization. Through extensive experiments, we demonstrate our method's ability to synthesize 360-degree realistic, coherent human motion videos, paving the way for advanced multimedia applications in areas such as virtual reality and animation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22669–22679, 2023.
  2. Person image synthesis via denoising diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5968–5976, 2023.
  3. Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8726–8737, 2023.
  4. Dna-rendering: A diverse neural actor repository for high-fidelity human-centric rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19982–19993, 2023.
  5. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  6. Cameractrl: Enabling camera control for text-to-video generation, 2024.
  7. Animate anyone: Consistent and controllable image-to-video synthesis for character animation. arXiv preprint arXiv:2311.17117, 2023.
  8. Learning high fidelity depths of dressed humans by watching social media dance videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12753–12762, 2021.
  9. Human-art: A versatile human-centric dataset bridging natural and artificial scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 618–629, 2023.
  10. Dreampose: Fashion video synthesis with stable diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22680–22690, 2023.
  11. Same: Skeleton-agnostic motion embedding for character animation. In SIGGRAPH Asia 2023 Conference Papers, pages 1–11, 2023.
  12. Motion-x: A large-scale 3d expressive whole-body human motion dataset. Advances in Neural Information Processing Systems, 36, 2024.
  13. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9298–9309, 2023a.
  14. Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023b.
  15. Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint arXiv:2310.15008, 2023.
  16. Smpl: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 851–866. 2023.
  17. Vdt: General-purpose video diffusion transformers via mask modeling. In The Twelfth International Conference on Learning Representations, 2023.
  18. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024.
  19. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  20. OpenAI. Video generation models as world simulators. https://openai.com/index/video-generation-models-as-world-simulators/, 2024. Accessed: 2024-05-19.
  21. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
  22. Improving language understanding by generative pre-training. 2018.
  23. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  24. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  25. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
  26. Zero123++: a single image to consistent multi-view diffusion base model. arXiv preprint arXiv:2310.15110, 2023a.
  27. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023b.
  28. Deformable gans for pose-based human image generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3408–3416, 2018.
  29. Appearance and pose-conditioned human image generation using deformable gans. IEEE transactions on pattern analysis and machine intelligence, 43(4):1156–1171, 2019a.
  30. First order motion model for image animation. Advances in neural information processing systems, 32, 2019b.
  31. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7262–7272, 2021.
  32. A good image generator is what you need for high-resolution video synthesis. arXiv preprint arXiv:2104.15069, 2021.
  33. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
  34. Aist dance video database: Multi-genre, multi-dancer, and multi-camera database for dance information processing. In Proceedings of the 20th International Society for Music Information Retrieval Conference, ISMIR 2019, Delft, Netherlands, 2019.
  35. Twindom. Twindom 3d avatar dataset, 2022.
  36. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  37. Disco: Disentangled control for referring human dance generation in real world. arXiv e-prints, pages arXiv–2307, 2023.
  38. One-shot free-view neural talking-head synthesis for video conferencing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10039–10049, 2021.
  39. G3an: Disentangling appearance and motion for video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5264–5273, 2020.
  40. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in neural information processing systems, 34:12077–12090, 2021.
  41. Magicanimate: Temporally consistent human image animation using diffusion model. arXiv preprint arXiv:2311.16498, 2023.
  42. Direct-a-video: Customized video generation with user-directed camera movement and object motion. arXiv preprint arXiv:2402.03162, 2024.
  43. Generating holistic 3d human motion from speech. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 469–480, 2023.
  44. Function4d: Real-time human volumetric capture from very sparse consumer rgbd sensors. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR2021), 2021.
  45. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF international conference on computer vision, pages 558–567, 2021.
  46. Closet: Modeling clothed humans on continuous surface with explicit template decomposition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 501–511, 2023a.
  47. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023b.
  48. Fast training of diffusion models with masked transformers. arXiv preprint arXiv:2306.09305, 2023.
  49. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6881–6890, 2021.
  50. Champ: Controllable and consistent human image animation with 3d parametric guidance. arXiv preprint arXiv:2403.14781, 2024.
Citations (3)

Summary

  • The paper introduces a transformer-based diffusion approach that generates high-fidelity 360-degree human videos from a single image with arbitrary camera views.
  • It employs a cascaded 4D transformer architecture that factorizes attention across spatial, temporal, and view dimensions, enhancing multi-view coherence.
  • Experimental results demonstrate significant improvements in PSNR, SSIM, and FVD, validating its effectiveness for free-view video synthesis.

4D Diffusion Transformer for Free-View Human Video Generation

Introduction

Human4DiT proposes a unified, transformer-based diffusion framework for generating high-fidelity, spatio-temporally coherent human videos from a single reference image, under arbitrary camera viewpoints and with support for complex articulated motions. The methodology synergizes the precision of local condition injection via UNets with the global relational modeling capacities of transformers, introducing a novel and computationally efficient 4D diffusion transformer architecture factorizing attention across spatial, temporal, and view axes. The system is trained on a multi-modal, multi-dimensional dataset and enables synthesis of monocular, multi-view, 3D static, and unconstrained free-view human videos. The framework sets a new standard for human-centric video generation in terms of multi-view controllability, coherence, and visual fidelity.

Architecture and Method

The core component is a cascaded 4D diffusion transformer that processes the latent spatio-temporal-view sequence in the denoising diffusion paradigm. Input consists of a reference image, dynamic SMPL body parameters over time, and camera pose parameters. UNet modules preprocess these signals into pixel-aligned feature embeddings, which are then tokenized and passed into the transformer sequence. The model factorizes global attention via distinct transformer blocks:

  • 2D Image Transformer: Attends over (H×W)(H \times W) per-frame spatial tokens for local context.
  • View Transformer: Models cross-view correlations by grouping spatial tokens across viewpoints, with explicit injection of camera pose features.
  • Temporal Transformer: Aggregates long-range temporal dependencies by modeling tokens across time for every spatial-view combination.

These blocks are sequentially applied and repeated in depth, forming a cascaded 4D transformer block. This factorization decomposes the O(N4)O(N^4) attention complexity, enabling scalability with respect to spatial, temporal, and view axes. Figure 1

Figure 1: Pipeline schematic of Human4DiT, showing the cascaded architecture with explicit spatial, view, and temporal transformer components, and multimodal control injection.

Conditioning modules inject human identity embeddings (via UNet and CLIP), frame-level SMPL normal maps for articulated pose, temporally encoded frame indices, and view-wise positional encodings derived from camera extrinsics. All control signals are mapped to the transformer’s latent token embedding space.

Dataset and Training Paradigm

Human4DiT is trained on a curated, large-scale multi-modal dataset comprising:

  • Images for 2D identity feature learning
  • Monocular, multi-view, and 3D static videos
  • A limited set of fully-annotated 4D dynamic scans (view, time, pose)

Training utilizes a modality-aware mixed strategy: the respective transformer blocks are activated according to the available modalities, maximizing supervision for each subdimension (spatial, temporal, view). This allows efficient utilization of partial datasets and robust fusion across modalities for generalizable 4D motion generation.

Spatial-Temporal Sampling for Free-View Generation

To overcome the inherent window-size/memory bottleneck of 4D transformer blocks, free-view long motion synthesis is implemented via a two-strategy spatio-temporal sampling procedure:

  1. Temporal-First Pass: Generates temporally extended monocular sequences with minimal view span, establishing long-term coherence.
  2. View-Windowed Pass: Refines multi-view consistency by processing larger spatial-view windows in shorter temporal fragments.

Predictions from the two strategies are merged in the denoising loop, balancing between global view consistency and detailed temporal alignment. This enables generation of extended 360-degree free-viewpoint sequences without exhaustively attending over prohibitively sized 4D tokens.

Experimental Results

Comprehensive evaluations benchmark Human4DiT against SOTA video generation baselines, including diffusion approaches (Champ, MagicAnimate, AnimateAnyone, Disco). The results demonstrate:

  • Substantial gains in reconstruction metrics: On monocular video synthesis, Human4DiT achieves PSNR 26.12, SSIM 0.888, LPIPS 0.116, FVD 237.4, surpassing all baselines (maximum relative improvement: ∼+3.2\sim\text{+3.2} PSNR, ∼+0.06\sim\text{+0.06} SSIM, ∼–0.06\sim\text{–0.06} LPIPS, ∼–122\sim\text{–122} FVD).
  • Multi-view and 4D Consistency: The model consistently delivers higher cross-view and 4D dynamic realism (e.g., Free-view PSNR 25.02, SSIM 0.947, LPIPS 0.062, FVD 234.8).
  • Ablation validates the necessity of view transformer blocks: Disabling view transformers leads to significant degradations on all multi-view metrics. Figure 2

    Figure 2: Qualitative comparison for monocular video generation highlighting enhanced temporal and spatial coherence in Human4DiT outputs.

Qualitative analyses further reveal generation that is notably free of typical artifacts (jitter, identity drift, multi-face errors), preserving high-fidelity human details and articulated motion across both temporal and viewpoint axes.

Discussion and Implications

Human4DiT establishes that transformer-based diffusion models, when armed with computationally efficient factorized attention and multimodal conditioning, can overcome the limitations of UNet-centric architectures in global coherence and viewpoint control for human video generation. The explicit modeling of high-dimensional spatial-temporal-view correlations yields strong generalization to challenging tasks (e.g., arbitrarily long, 360-degree, identity-preserving motion synthesis from a single image), expanding applicability in virtual reality, digital humans, and content authoring.

The methodology advances the field in two major aspects: (1) efficient global modeling of articulated, view-dependent human motion without explicit 3D reconstruction, and (2) a scalable, modality-agnostic training regimen capable of leveraging heterogeneous, partially annotated datasets. However, the absence of explicit 4D geometric representations implies some failure modes in subtle structure and occlusion reasoning (notably for fingers and accessories).

Conclusion

Human4DiT (2405.17405) demonstrates a rigorously engineered transformer-based diffusion approach for free-view, temporally consistent, and identity-preserving human video synthesis. Through its cascaded 4D transformer architecture and modality-adaptive learning paradigm, the system delivers empirically superior results across diverse generation scenarios. The framework’s design principles and observed limitations inform future research on integrating explicit 3D scene awareness and finer structure modeling, with potential implications across volumetric video generation, digital avatars, and dynamic scene synthesis.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 111 likes about this paper.