Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Abstract: We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets. However, training methods in the literature vary widely, and the field has yet to agree on a unified strategy for curating video data. In this paper, we identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning. Furthermore, we demonstrate the necessity of a well-curated pretraining dataset for generating high-quality videos and present a systematic curation process to train a strong base model, including captioning and filtering strategies. We then explore the impact of finetuning our base model on high-quality data and train a text-to-video model that is competitive with closed-source video generation. We also show that our base model provides a powerful motion representation for downstream tasks such as image-to-video generation and adaptability to camera motion-specific LoRA modules. Finally, we demonstrate that our model provides a strong multi-view 3D-prior and can serve as a base to finetune a multi-view diffusion model that jointly generates multiple views of objects in a feedforward fashion, outperforming image-based methods at a fraction of their compute budget. We release code and model weights at https://github.com/Stability-AI/generative-models .
- Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation. arXiv preprint arXiv:2304.08477, 2023.
- Renderdiffusion: Image diffusion for 3d reconstruction, inpainting and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12608–12618, 2023.
- A general language assistant as a laboratory for alignment, 2021.
- Stochastic variational video prediction. In International Conference on Learning Representations, 2018.
- Character region awareness for text detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9365–9374, 2019.
- Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
- Frozen in time: A joint video and image encoder for end-to-end retrieval, 2022.
- ipoke: Poking a still image for controlled stochastic video synthesis. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, 2021.
- Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models. arXiv:2304.08818, 2023.
- Generating long videos of dynamic scenes. In NeurIPS, 2022.
- Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
- Improved conditional vrnns for video prediction. In The IEEE International Conference on Computer Vision (ICCV), 2019.
- Emu: Enhancing image generation models using photogenic needles in a haystack, 2023.
- Objaverse-XL: A universe of 10m+ 3d objects. arXiv preprint arXiv:2307.05663, 2023a.
- Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023b.
- Nerdi: Single-view nerf synthesis with language-guided diffusion as general image priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20637–20647, 2023.
- Stochastic video generation with a learned prior. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, 2018.
- Diffusion Models Beat GANs on Image Synthesis. arXiv:2105.05233, 2021.
- Stochastic image-to-video synthesis using cinns. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, 2021.
- Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA), pages 2553–2560. IEEE, 2022.
- Arpad E. Elo. The Rating of Chessplayers, Past and Present. Arco Pub., New York, 1978.
- Taming transformers for high-resolution image synthesis. arXiv preprint arXiv:2012.09841, 2020.
- Structure and content-guided video synthesis with diffusion models, 2023.
- Gunnar Farnebäck. Two-frame motion estimation based on polynomial expansion. pages 363–370, 2003.
- Stylevideogan: A temporal generative model using a pretrained stylegan. In British Machine Vision Conference (BMVC), 2021.
- Stochastic latent residual video prediction. In Proceedings of the 37th International Conference on Machine Learning, 2020.
- The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Long video generation with time-agnostic vqgan and time-sensitive transformer. In Computer Vision – ECCV 2022, pages 102–118, Cham, 2022. Springer Nature Switzerland.
- Preserve your own correlation: A noise prior for video diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22930–22941, 2023.
- Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
- Reuse and diffuse: Iterative denoising for text-to-video generation. arXiv preprint arXiv:2309.03549, 2023.
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
- Rv-gan: Recurrent gan for unconditional video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 2024–2033, 2022.
- Nicholas Guttenberg and CrossLabs. Diffusion with offset noise, 2023.
- Latent video diffusion models for high-fidelity long video generation, 2023.
- Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
- Classifier-Free Diffusion Guidance. arXiv:2207.12598, 2022.
- Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020.
- Cascaded diffusion models for high fidelity image generation. arXiv preprint arXiv:2106.15282, 2021.
- Imagen Video: High Definition Video Generation with Diffusion Models. arXiv:2210.02303, 2022a.
- Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022b.
- Video diffusion models. arXiv preprint arXiv:2204.03458, 2022c.
- Cogvideo: Large-scale pretraining for text-to-video generation via transformers, 2022.
- simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Estimation of Non-Normalized Statistical Models by Score Matching. Journal of Machine Learning Research, 6(4), 2005.
- Openclip, 2021.
- Itseez. Open source computer vision library. https://github.com/itseez/opencv, 2015.
- Shap-e: Generating conditional 3d implicit functions, 2023.
- Lower dimensional kernels for video discriminators. Neural Networks, 132:506–520, 2020.
- Elucidating the Design Space of Diffusion-Based Generative Models. arXiv:2206.00364, 2022.
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators, 2023.
- Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021.
- Pika Labs. Pika labs, https://www.pika.art/, 2023.
- Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018.
- Common Diffusion Noise Schedules and Sample Steps are Flawed. arXiv:2305.08891, 2023.
- Zero-1-to-3: Zero-shot one image to 3d object, 2023a.
- Syncdreamer: Generating multiview-consistent images from a single-view image. arXiv preprint arXiv:2309.03453, 2023b.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Transformation-based adversarial video prediction on large-scale data. ArXiv, 2020.
- On distillation of guided diffusion models, 2023.
- Point-e: A system for generating 3d point clouds from complex prompts, 2022.
- The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
- SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. arXiv:2307.01952, 2023.
- Training contrastive captioners. LAION blog, 2023.
- Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.
- Aditya Ramesh. How dall·e 2 works, 2022.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022a.
- Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv:2204.06125, 2022b.
- High-Resolution Image Synthesis with Latent Diffusion Models. arXiv preprint arXiv:2112.10752, 2021a.
- High-resolution image synthesis with latent diffusion models. arXiv preprint arXiv:2112.10752, 2021b.
- U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv:1505.04597, 2015.
- RunwayML. Gen-2 by runway, https://research.runwayml.com/gen2, 2023.
- Image super-resolution via iterative refinement. arXiv preprint arXiv:2104.07636, 2021.
- Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
- Temporal generative adversarial nets with singular value clipping. In ICCV, 2017.
- Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal gan. International Journal of Computer Vision, 2020.
- Progressive Distillation for Fast Sampling of Diffusion Models. arXiv preprint arXiv:2202.00512, 2022.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
- Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
- Make-A-Video: Text-to-Video Generation without Text-Video Data. arXiv:2209.14792, 2022.
- Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3626–3636, 2022.
- Deep Unsupervised Learning using Nonequilibrium Thermodynamics. arXiv:1503.03585, 2015.
- Understanding and mitigating copying in diffusion models, 2023.
- Improved Techniques for Training Score-Based Generative Models. arXiv:2006.09011, 2020.
- Score-Based Generative Modeling through Stochastic Differential Equations. arXiv:2011.13456, 2020.
- Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer, 2020.
- A good image generator is what you need for high-resolution video synthesis. In International Conference on Learning Representations, 2021.
- Suramya Tomar. Converting video formats with ffmpeg. Linux Journal, 2006(146):10, 2006.
- Score-based generative modeling in latent space. In Advances in Neural Information Processing Systems, 2021.
- Decomposing motion and content for natural video sequence prediction. ICLR, 2017.
- Phenaki: Variable length video generation from open domain textual description. arXiv:2210.02399, 2022.
- Mcvd: Masked conditional video diffusion for prediction, generation, and interpolation. In (NeurIPS) Advances in Neural Information Processing Systems, 2022.
- Generating videos with scene dynamics. In Proceedings of the 30th International Conference on Neural Information Processing Systems, 2016.
- Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a.
- G3an: Disentangling appearance and motion for video generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023b.
- Internvid: A large-scale video-text dataset for multimodal understanding and generation, 2023c.
- Novel view synthesis with diffusion models, 2022.
- Scaling autoregressive video models. In International Conference on Learning Representations, 2020.
- Godiva: Generating open-domain videos from natural descriptions. arXiv:2104.14806, 2021.
- Nüwa: Visual synthesis pre-training for neural visual world creation. In European Conference on Computer Vision, pages 720–736. Springer, 2022.
- Demystifying clip data, 2023.
- Msr-vtt: A large video description dataset for bridging video and language. In International Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Videogpt: Video generation using vq-vae and transformers, 2021.
- Coca: Contrastive captioners are image-text foundation models, 2022a.
- Keunwoo Peter Yu. Videoblip. https://github.com/yukw777/VideoBLIP, 2023. If you use VideoBLIP, please cite it as below.
- Generating videos with dynamics-aware implicit generative adversarial networks. In International Conference on Learning Representations, 2022b.
- Mvimgnet: A large-scale dataset of multi-view images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9150–9161, 2023.
- The unreasonable effectiveness of deep features as a perceptual metric, 2018.
- I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv preprint arXiv:2311.04145, 2023a.
- Controlvideo: Training-free controllable text-to-video generation, 2023b.
- Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
- Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12588–12597, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.