Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation

Published 21 Feb 2024 in cs.CV | (2402.13729v4)

Abstract: Generating high-quality videos that synthesize desired realistic content is a challenging task due to their intricate high-dimensionality and complexity of videos. Several recent diffusion-based methods have shown comparable performance by compressing videos to a lower-dimensional latent space, using traditional video autoencoder architecture. However, such method that employ standard frame-wise 2D and 3D convolution fail to fully exploit the spatio-temporal nature of videos. To address this issue, we propose a novel hybrid video diffusion model, called HVDM, which can capture spatio-temporal dependencies more effectively. The HVDM is trained by a hybrid video autoencoder which extracts a disentangled representation of the video including: (i) a global context information captured by a 2D projected latent (ii) a local volume information captured by 3D convolutions with wavelet decomposition (iii) a frequency information for improving the video reconstruction. Based on this disentangled representation, our hybrid autoencoder provide a more comprehensive video latent enriching the generated videos with fine structures and details. Experiments on video generation benchamarks (UCF101, SkyTimelapse, and TaiChi) demonstrate that the proposed approach achieves state-of-the-art video generation quality, showing a wide range of video applications (e.g., long video generation, image-to-video, and video dynamics control).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Compositional foundation models for hierarchical planning. arXiv preprint arXiv:2309.08587, 2023.
  2. Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv preprint arXiv:2304.08477, 2023.
  3. Renderdiffusion: Image diffusion for 3d reconstruction, inpainting and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12608–12618, 2023.
  4. Is space-time attention all you need for video understanding? 2021.
  5. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
  6. Pix2video: Video editing using image diffusion. arXiv preprint arXiv:2303.12688, 2023.
  7. Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16123–16133, 2022.
  8. Pi-GAN: Periodic implicit generative adversarial networks for 3D-aware image synthesis. In CVPR, 2021.
  9. Learning universal policies via text-guided video generation. arXiv preprint arXiv:2302.00111, 2023.
  10. Structure and content-guided video synthesis with diffusion models. arXiv preprint arXiv:2302.03011, 2023.
  11. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019.
  12. Nerfdiff: Single-image view synthesis with nerf-guided distillation from 3d-aware diffusion. In International Conference on Machine Learning, pages 11808–11826. PMLR, 2023.
  13. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv preprint arXiv:2211.13221, 2022.
  14. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  15. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020.
  16. Video diffusion models, 2022. URL https://arxiv. org/abs/2204.03458.
  17. Video diffusion models. arXiv preprint arXiv:2204.03458, 2022.
  18. Image quality metrics: Psnr vs. ssim. In 2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010.
  19. Lamd: Latent motion diffusion for video generation. arXiv preprint arXiv:2304.11603, 2023.
  20. Scope of validity of psnr in image/video quality assessment. Electronics letters, 44(13):800–801, 2008.
  21. Video pixel networks. 2017.
  22. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439, 2023.
  23. Video generation from text. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  24. Video-p2p: Video editing with cross-attention control. arXiv preprint arXiv:2303.04761, 2023.
  25. Vidm: Video implicit diffusion models. In AAAI, 2023.
  26. To create what you tell: Generating videos from captions. In Proceedings of the 25th ACM international conference on Multimedia, pages 1789–1798, 2017.
  27. Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535, 2023.
  28. Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10219–10228, 2023.
  29. Graf: Generative radiance fields for 3d-aware image synthesis. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  30. First order motion model for image animation. In NeurIPS, 2019.
  31. Make-a-video: Text-to-video generation without text-video data. In ICLR, 2023.
  32. StyleGAN-V: A continuous video generator with the price, image quality and perks of StyleGAN2. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  33. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3626–3636, 2022.
  34. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  35. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  36. A good image generator is what you need for high-resolution video synthesis. In ICLR, 2021.
  37. MoCoGAN: Decomposing motion and content for video generation. In CVPR, 2018.
  38. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
  39. Fvd: A new metric for video generation. In DGS@ICLR, 2019.
  40. Attention is all you need. In NeurIPS, 2017.
  41. Mcvd-masked conditional video diffusion for prediction, generation, and interpolation. Advances in Neural Information Processing Systems, 35:23371–23385, 2022.
  42. Generating videos with scene dynamics. In NeurIPS, 2016.
  43. Rodin: A generative model for sculpting 3d digital avatars using diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4563–4573, 2023.
  44. Scaling autoregressive video models. arXiv preprint arXiv:1906.02634, 2019.
  45. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv preprint arXiv:2212.11565, 2022.
  46. Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In CVPR, 2018.
  47. VideoGPT: Video generation using VQ-VAE and transformers. arXiv preprint arXiv:2104.10157, 2021.
  48. Video probabilistic diffusion models in projected latent space. arXiv preprint arXiv:2302.07685, 2023.
  49. Generating videos with dynamics-aware implicit generative adversarial networks. In The Tenth International Conference on Learning Representations, 2022.
  50. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  51. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
Citations (1)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.