Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

Published 1 Jun 2023 in cs.CV | (2306.00943v1)

Abstract: Creating a vivid video from the event or scenario in our imagination is a truly fascinating experience. Recent advancements in text-to-video synthesis have unveiled the potential to achieve this with prompts only. While text is convenient in conveying the overall scene context, it may be insufficient to control precisely. In this paper, we explore customized video generation by utilizing text as context description and motion structure (e.g. frame-wise depth) as concrete guidance. Our method, dubbed Make-Your-Video, involves joint-conditional video generation using a Latent Diffusion Model that is pre-trained for still image synthesis and then promoted for video generation with the introduction of temporal modules. This two-stage learning scheme not only reduces the computing resources required, but also improves the performance by transferring the rich concepts available in image datasets solely into video generation. Moreover, we use a simple yet effective causal attention mask strategy to enable longer video synthesis, which mitigates the potential quality degradation effectively. Experimental results show the superiority of our method over existing baselines, particularly in terms of temporal coherence and fidelity to users' guidance. In addition, our model enables several intriguing applications that demonstrate potential for practical usage.

Abstract PDF Upgrade to Chat

Citations (59)

View on Semantic Scholar

Summary

The paper presents a novel two-stage learning process that integrates textual descriptions and structural cues for precise video generation.
It adapts a pre-trained latent diffusion model for video synthesis, achieving superior temporal coherence and fidelity compared to existing methods.
Quantitative evaluations demonstrate significant improvements over baselines, paving the way for advanced applications like dynamic scene modeling and video re-rendering.

Overview of Customized Video Generation

Research in the field of AI-driven video generation has made a leap forward with the use of text prompts to guide video synthesis. Although text can capture a scene's context, it often falls short in providing precise control over the video content. Recognizing this, a paper introduced an innovative method titled "Make-Your-Video". This method utilizes a combination of textual descriptions and structural guidance, like frame-wise depth maps, to create meticulously controlled and customized videos.

The Innovation

For the first time, a Latent Diffusion Model (LDM), originally pre-trained for image synthesis, is adapted for video generation. Implementing a two-stage learning process, the researchers first trained spatial modules using richly conceptual image datasets, and then added temporal modules for video-specific coherence. A key challenge was to ensure these videos were not only high quality but also temporally coherent. The solution was a design that allowed for longer video synthesis without degrading the quality. By using a cause attention mask, the model could generate videos with more extended sequences that remain high in fidelity to the user's instructions.

Model Performance

The method has proven superior to existing baselines in terms of both temporal coherence and fidelity to user guidance. This is illustrated through rigorous quantitative evaluations using established benchmarks. It demonstrates that the combination of textual and structural guidance provides users with unparalleled control over the video generation process.

Practical Applications and Future Implications

The flexibility of the Make-Your-Video model opens up numerous practical applications, from transforming real-life scene setups into photorealistic videos, dynamic 3D scene modeling, or even video re-rendering. This model also promises potential for practical scenarios beyond the capability of other existing text-to-video techniques. While the current model has certain limitations, like lack of precise control over the visual appearance and the need for frame-wise depth guidance, it marks a significant step toward efficient and controllable video generation that aligns with user intentions.

In conclusion, the "Make-Your-Video" model sets a new standard for AI-generated videos that are not only visually impressive but also align closely with human creativity and control. This method is a stride towards bridging the gap between imagining a scene and bringing it to life through video.

Markdown Report Issue