Papers
Topics
Authors
Recent
Search
2000 character limit reached

Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis

Published 22 Feb 2024 in cs.CV and cs.AI | (2402.14797v1)

Abstract: Contemporary models for generating images show remarkable quality and versatility. Swayed by these advantages, the research community repurposes them to generate videos. Since video content is highly redundant, we argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability. In this work, we build Snap Video, a video-first model that systematically addresses these challenges. To do that, we first extend the EDM framework to take into account spatially and temporally redundant pixels and naturally support video generation. Second, we show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead. Hence, we propose a new transformer-based architecture that trains 3.31 times faster than U-Nets (and is ~4.5 faster at inference). This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity. The user studies showed that our model was favored by a large margin over the most recent methods. See our website at https://snap-research.github.io/snapvideo/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Pika lab discord server. https://www.pika.art/. Accessed: 2023-11-01.
  2. Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv, 2023.
  3. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. ArXiv, 2022.
  4. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  5. Generating long videos of dynamic scenes. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  6. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  7. Ting Chen. On the importance of noise scheduling for diffusion models. arXiv, 2023.
  8. Fit: Far-reaching interleaved transformers. arXiv, 2023.
  9. Efficient video generation on complex datasets. arXiv, 2019.
  10. Diffusion models beat GANs on image synthesis. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  11. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  12. Long video generation with time-agnostic vqgan and time-sensitive transformer. In Proceedings of the European Conference of Computer Vision (ECCV), 2022.
  13. Preserve your own correlation: A noise prior for video diffusion models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023.
  14. f-dm: A multi-stage diffusion model via progressive signal transformation. International Conference on Learning Representations (ICLR), 2023a.
  15. Matryoshka diffusion models. arXiv, 2023b.
  16. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv, 2023.
  17. Latent video diffusion models for high-fidelity long video generation. arXiv, 2023.
  18. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  19. Classifier-free diffusion guidance. arXiv, 2022.
  20. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  21. Imagen video: High definition video generation with diffusion models. arXiv, 2022a.
  22. Video diffusion models. In ICLR Workshop on Deep Generative Models for Highly Structured Data, 2022b.
  23. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv, 2022.
  24. Simple diffusion: End-to-end diffusion for high resolution images. In Proceedings of the 40th International Conference on Machine Learning (ICML), 2023.
  25. Elucidating the design space of diffusion-based generative models. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  26. Adam: A method for stochastic optimization. arXiv, 2015.
  27. The role of imagenet classes in fréchet inception distance. In International Conference on Learning Representations (ICLR), 2023.
  28. Stochastic adversarial video prediction. arXiv, abs/1804.01523, 2018.
  29. Video generation from text. Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2018.
  30. Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. arXiv, 2023.
  31. Evalcrafter: Benchmarking and evaluating large video generation models. arXiv, 2023.
  32. Videofusion: Decomposed diffusion models for high-quality video generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  33. Image and video compression with neural networks: A review. IEEE Transactions on Circuits and Systems for Video Technology, 2019.
  34. Sync-draw: Automatic video generation using deep recurrent attentive architectures. In Proceedings of the 25th ACM International Conference on Multimedia, 2017.
  35. CCVS: Context-aware controllable video synthesis. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  36. PyTorch: An Imperative Style, High-Performance Deep Learning Library. 2019.
  37. Fatezero: Fusing attentions for zero-shot text-based video editing. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2023.
  38. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), 2021.
  39. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research (JMLR), 2022.
  40. High-resolution image synthesis with latent diffusion models. arXiv, 2021.
  41. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015.
  42. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  43. Photorealistic text-to-image diffusion models with deep language understanding. arXiv, 2022.
  44. Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal gan, 2020.
  45. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations (ICLR), 2022.
  46. Improved techniques for training gans. In Advances in Neural Information Processing Systems (NeurIPS), 2016.
  47. Mostgan-v: Video generation with temporal motion styles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  48. Make-a-video: Text-to-video generation without text-video data. arXiv, 2022.
  49. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  50. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning (ICML), 2015.
  51. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  52. Improved techniques for training score-based generative models. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  53. Maximum likelihood training of score-based diffusion models, 2021a.
  54. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations (ICLR), 2021b.
  55. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv, 2012.
  56. Unsupervised learning of video representations using lstms. In International Conference on Machine Learning (ICML), 2015.
  57. Relay diffusion: Unifying diffusion process across resolutions for image synthesis. arXiv, 2023.
  58. A good image generator is what you need for high-resolution video synthesis. In International Conference on Learning Representations (ICLR), 2021.
  59. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  60. Towards accurate generative models of video: A new metric & challenges. arXiv, 2018.
  61. Phenaki: Variable length video generation from open domain textual description. In International Conference on Learning Representations (ICLR), 2023.
  62. Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. arXiv, 2023.
  63. Godiva: Generating open-domain videos from natural descriptions. ArXiv, 2021.
  64. Nüwa: Visual synthesis pre-training for neural visual world creation. In Proceedings of the European Conference of Computer Vision (ECCV), 2022.
  65. Msr-vtt: A large video description dataset for bridging video and language. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  66. Videogpt: Video generation using vq-vae and transformers. arXiv, 2021.
  67. Nuwa-xl: Diffusion over diffusion for extremely long video generation. In Annual Meeting of the Association for Computational Linguistics, 2023.
  68. Large batch optimization for deep learning: Training bert in 76 minutes. In International Conference on Learning Representations (ICLR), 2020.
  69. Scaling autoregressive models for content-rich text-to-image generation. Transactions on Machine Learning Research, 2022a.
  70. Magvit: Masked generative video transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  71. Generating videos with dynamics-aware implicit generative adversarial networks. In International Conference on Learning Representations (ICLR), 2022b.
  72. Magicvideo: Efficient video generation with latent diffusion models. arXiv, 2023.
Citations (30)

Summary

  • The paper presents a transformer-based architecture that overcomes spatial and temporal redundancies in video synthesis.
  • It leverages a compressed 1D latent vector for joint spatiotemporal computation, reducing overhead while boosting motion fidelity.
  • The model achieves state-of-the-art performance on benchmarks like UCF101 and MSR-VTT, demonstrating superior text alignment and visual realism.

Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis

This essay analyzes the methodologies and contributions of the "Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis" paper, offering an in-depth look at its technical approach and implementation. Snap Video introduces a novel approach to text-to-video generation by addressing shortcomings in current video generation techniques, particularly in achieving higher temporal consistency and motion complexity.

Methodology and Technical Contributions

The paper identifies intrinsic limitations in adapting image generation models directly to video synthesis, notably due to spatial and temporal redundancies. Snap Video circumvents these by implementing a transformer-based architecture rather than the traditional U-Net. The methodologies involve extending the EDM framework to accommodate video-specific requirements by managing spatial and temporal dimensions in a unified manner, thereby naturally supporting video generation.

Transformer-Based Architecture

Snap Video's use of transformers is significant, as this architecture allows processing of spatial and temporal data more efficiently than U-Nets. The implemented spatiotemporal transformers leverage a compressed 1D latent vector for joint spatiotemporal computation, which significantly reduces computational overhead and improves scalability. This innovation allows Snap Video to efficiently handle video-specific challenges such as motion fidelity and visual quality: Figure 1

Figure 1: Analysis of Signal-to-Noise Ratio (SNR), demonstrating the impact of scale-adjusted noise application in video frames.

The transition from U-Nets to transformers reflects a performance increase in training and inference speed, as evidenced by comparisons on internal datasets where Snap Video outperforms baseline architectures in both speed and generative quality.

Performance Evaluation

Snap Video's effectiveness is highlighted through state-of-the-art performance metrics compared to existing models. On benchmarks such as UCF101 and MSR-VTT, Snap Video demonstrates superior performance in terms of the Incorporation Score (IS) and Frechet Video Distance (FVD). Additionally, user studies confirm the model's superiority in achieving high photorealism, accurate text alignment, and effective motion rendering compared to models like Gen-2, Pika, and Floor33: Figure 2

Figure 2: Qualitative comparison results showing the temporal coherence achieved by Snap Video over existing methods.

Evaluation on standard datasets indicates the model's ability to produce dynamic and coherent motion, addressing issues of flickering and artifact generation that are prevalent in other models. Moreover, Snap Video displays better text-video alignment due to the robust integration of text embeddings.

Practical Implications and Future Considerations

The development of Snap Video has broad implications for the future of video synthesis. By demonstrating enhanced efficiency and scalability, the model provides a strong foundation for further advancements in video generation. It shows potential for application in diverse areas such as content creation, animation, and virtual reality, where dynamic video content is essential.

Additionally, Snap Video's architecture allows for future exploration into higher-resolution video synthesis, potentially expanding its applicability to even more complex visual tasks. The successful implementation of joint spatiotemporal modeling opens new avenues for integrating similar approaches in other generative models.

Conclusion

Snap Video sets a new benchmark in text-to-video synthesis by effectively addressing the limitations of prior image-model adaptations. It demonstrates the significant benefits of transformer-based architectures in processing spatiotemporal data for video generation. Future work should focus on expanding the model's capabilities towards higher resolutions and incorporating real-time adaptability, making it an integral part of advanced multimedia applications.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 6 tweets with 44 likes about this paper.