Movie Gen: A Cast of Media Foundation Models

Published 17 Oct 2024 in cs.CV, cs.AI, cs.LG, and eess.IV | (2410.13720v2)

Abstract: We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user's image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, corresponding to a generated video of 16 seconds at 16 frames-per-second. We show multiple technical innovations and simplifications on the architecture, latent spaces, training objectives and recipes, data curation, evaluation protocols, parallelization techniques, and inference optimizations that allow us to reap the benefits of scaling pre-training data, model size, and training compute for training large scale media generation models. We hope this paper helps the research community to accelerate progress and innovation in media generation models. All videos from this paper are available at https://go.fb.me/MovieGenResearchVideos.

Abstract PDF HTML Upgrade to Chat

Citations (43)

View on Semantic Scholar

Summary

The paper introduces a 30B parameter transformer trained on 73K video tokens, setting new benchmarks for text-to-video synthesis and editing.
It details methods for video personalization by preserving identity features from facial inputs, enabling customized video generation from textual prompts.
Additionally, the paper presents innovative techniques for video editing and cinematic audio generation using diffusion transformers and flow-matching objectives.

Overview of "Movie Gen: A Cast of Media Foundation Models"

The paper "Movie Gen: A Cast of Media Foundation Models," introduces a comprehensive suite of foundation models designed to generate high-quality 1080p HD videos with synchronized audio, showcasing capabilities such as text-to-video synthesis, video personalization, and precise video editing. These models represent the state-of-the-art across multiple tasks, effectively setting new benchmarks for media generation.

Key Contributions

Model Architecture and Training:
- The core of Movie Gen's architecture is a 30B parameter transformer model trained with a maximum context length of 73K video tokens, equivalent to generating 16 seconds of video at 16 FPS.
- The paper outlines several technical innovations in architecture design, data curation, training protocols, and inference optimizations. These enhancements enable the model to handle the scaling of pre-training data and compute effectively.
Text-to-Video Generation:
- Movie Gen Video, the largest model in the suite, excels in text-to-image and text-to-video generation, supporting multiple aspect ratios and resolutions. This model is pretrained on a vast dataset comprising both video and images.
- The training process involves stages for scaling resolution and refining the model with high-quality video datasets to improve the motion and aesthetic quality of outputs.
Video Personalization:
- The Personalized Movie Gen Video model is capable of generating videos featuring specific individuals based on facial input, preserving identity while adhering to text prompts.
- The model is trained with a blend of paired and cross-paired data and utilizes a vision encoder to capture identity features from reference images.
Video Editing:
- Movie Gen Edit demonstrates state-of-the-art performance in video editing by employing innovative training techniques without relying on supervised video editing data.
- Key to its success is a multi-stage training process that begins with image editing and proceeds to more complex tasks like synthetic multi-frame video editing and backtranslation.
Audio Generation:
- Movie Gen Audio, a 13B parameter model, generates high-quality cinematic soundtracks with aligned sound effects and music scores to video inputs.
- It employs a novel combination of flow-matching training objectives and diffusion transformers, alongside audio codecs, to support long-form video audio generation.

Implications and Future Directions

The Movie Gen models have extended the boundaries of generative AI for media and offer promising implications for industries ranging from entertainment to personalized content creation.

Scalability and Efficiency: The methodologies demonstrated for scaling models imply that larger architectures can be efficiently managed and trained across extensive datasets, paving the way for further enhancements in media generation quality and diversity.
Benchmarking and Open Research: The release of comprehensive benchmarks like Movie Gen Video Bench and Movie Gen Audio Bench aims to standardize evaluation metrics, ensuring robust comparisons in future research.
Applications and Ethical Considerations: As these models approach real-world deployment, there are significant considerations for ethical usage, including bias, misuse, and the sociocultural impacts of media content generated by AI.

Overall, this paper marks a substantial advancement in the domain of media generation, providing a cornerstone for continued research and application in generative AI. It underscores the potential and challenges of scaling AI capabilities in video and audio synthesis, offering both technical and conceptual insights into building the next generation of generative models.

Markdown Report Issue