Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Published 25 Nov 2023 in cs.CV | (2311.15127v1)

Abstract: We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained for 2D image synthesis have been turned into generative video models by inserting temporal layers and finetuning them on small, high-quality video datasets. However, training methods in the literature vary widely, and the field has yet to agree on a unified strategy for curating video data. In this paper, we identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning. Furthermore, we demonstrate the necessity of a well-curated pretraining dataset for generating high-quality videos and present a systematic curation process to train a strong base model, including captioning and filtering strategies. We then explore the impact of finetuning our base model on high-quality data and train a text-to-video model that is competitive with closed-source video generation. We also show that our base model provides a powerful motion representation for downstream tasks such as image-to-video generation and adaptability to camera motion-specific LoRA modules. Finally, we demonstrate that our model provides a strong multi-view 3D-prior and can serve as a base to finetune a multi-view diffusion model that jointly generates multiple views of objects in a feedforward fashion, outperforming image-based methods at a fraction of their compute budget. We release code and model weights at https://github.com/Stability-AI/generative-models .

Abstract PDF HTML Upgrade to Chat

References (116)

Citations (605)

View on Semantic Scholar

Summary

The paper presents a scalable latent video diffusion model pre-trained on a curated dataset of 600 million clips, achieving superior text-to-video and image-to-video synthesis.
The paper employs a multi-stage training process—starting from text-to-image pretraining to specialized video finetuning—to enhance motion consistency and video quality.
The paper demonstrates that integrating camera motion-specific modules enables efficient multi-view and 3D prior generation, outperforming several specialized techniques.

Introduction to Video Synthesis Techniques

Generative models have made significant strides in synthesizing high-quality images, and the pursuit to extend this success to video generation has been a focal area of research. Video generative models have typically evolved from their image-based counterparts, with researchers modifying existing architectures by introducing temporal layers and adjusting training regimes. However, the influence of the training data and its curation has been somewhat overlooked, even though it's widely acknowledged that the data distribution profoundly impacts generative model performance. This paper tackles these unexplored aspects and presents a method for scaling latent video diffusion models to large datasets with a focus on text-to-video and image-to-video applications.

Systematic Approach to Data Curation

The paper begins by dissecting the video training process into three critical stages: text-to-image pretraining, video pretraining, and high-quality video finetuning. It posits that the pretraining phase must occur on a well-curated dataset—a dataset distilled from a large unfiltered collection to remove clips with limited motion and other unwanted characteristics. Through empirical analysis, the authors show that pretraining on such refined datasets leads to substantial improvements that carry over even after the finetuning stage.

Innovations in Video Diffusion Models

The core of the presented Stable Video Diffusion (SVD) approach lies in its robust base model trained on approximately 600 million video clips. This model acts as a springboard for further task-specific finetuning—for text-to-video generation, for instance, human evaluators favored the results over current state-of-the-art methods. Not only does SVD handle direct text-to-video synthesis effectively, but it also adapts to image-to-video generation where sequences are generated from a single image input, demonstrating the model's potent motion understanding.

Expanding into Multi-View and 3D Spaces

One of the paper's pivotal claims is the model's ability to serve as a multi-view 3D-prior. After finetuning on appropriate datasets, the SVD showcases its proficiency in generating multiple consistent views of an object in a feedforward fashion, achieving superior performance to several specialized techniques, while also requiring significantly less computational resources. Additionally, it introduces the adaptability to control motion through camera motion-specific LoRA modules, underscoring the model's versatility.

Conclusion and Implications

The authors conclude by affirming the importance of data curation and training strategy segmentation for video diffusion models. They present SVD as a generative video model that not only excels in high-resolution text-to-video and image-to-video synthesis but also sets new standards in multi-view consistency and efficiency in generative video modeling. With code and model weights publicly released, the authors invite further exploration and adoption of their findings in the broader video research community. This transparency ensures that SVD's contributions to the generative video modeling field will continue to foster innovation and refinement in AI-powered video synthesis.

Markdown Report Issue