MotionDirector: Motion Customization of Text-to-Video Diffusion Models

Published 12 Oct 2023 in cs.CV | (2310.08465v1)

Abstract: Large-scale pre-trained diffusion models have exhibited remarkable capabilities in diverse video generations. Given a set of video clips of the same motion concept, the task of Motion Customization is to adapt existing text-to-video diffusion models to generate videos with this motion. For example, generating a video with a car moving in a prescribed manner under specific camera movements to make a movie, or a video illustrating how a bear would lift weights to inspire creators. Adaptation methods have been developed for customizing appearance like subject or style, yet unexplored for motion. It is straightforward to extend mainstream adaption methods for motion customization, including full model tuning, parameter-efficient tuning of additional layers, and Low-Rank Adaptions (LoRAs). However, the motion concept learned by these methods is often coupled with the limited appearances in the training videos, making it difficult to generalize the customized motion to other appearances. To overcome this challenge, we propose MotionDirector, with a dual-path LoRAs architecture to decouple the learning of appearance and motion. Further, we design a novel appearance-debiased temporal loss to mitigate the influence of appearance on the temporal training objective. Experimental results show the proposed method can generate videos of diverse appearances for the customized motions. Our method also supports various downstream applications, such as the mixing of different videos with their appearance and motion respectively, and animating a single image with customized motions. Our code and model weights will be released.

Abstract PDF Upgrade to Chat

Citations (64)

View on Semantic Scholar

Summary

The paper introduces MotionDirector, a dual-path architecture that uses separate LoRAs for spatial and temporal features to decouple motion from appearance.
It employs an appearance-debiased temporal loss to enhance motion fidelity while preserving diverse visual styles and outperforms methods like Tune-A-Video.
Experimental results using human and automatic metrics confirm its superior motion customization and adaptability across various video scenarios.

MotionDirector: Motion Customization of Text-to-Video Diffusion Models

The paper entitled "MotionDirector: Motion Customization of Text-to-Video Diffusion Models" explores the burgeoning area of customizing motion in video generation using diffusion models. The research addresses a significant gap in the text-to-video generation domain, particularly the unexplored territory of dynamic motion customization without compromising appearance diversity.

Technical Contributions

The primary contribution of this research is the introduction of MotionDirector, a dual-path architecture that ingeniously decouples the learning of motion and appearance in video diffusion models. Unlike traditional methods that often bind motion to limited appearances during training, this architecture utilizes Low-Rank Adaptations (LoRAs) separately in spatial and temporal transformers. Through the integration of a novel appearance-debiased temporal loss, MotionDirector addresses the entanglement between motion and appearance, ensuring broader applicability across varying appearances.

The dual-path method is pivotal. It employs LoRAs in a spatial path to capture the appearance from single frames and in a temporal path for motion from multiple frames. Interestingly, the temporal path reuses spatial LoRAs ensuring consistency in appearance across frames, thereby preventing the conflation of motion with specific appearances seen during training. The appearance-debiased temporal loss further refines this by emphasizing motion learning while minimizing appearance-related biases.

Experimental Insights

The experimental evaluation on two benchmarks confirms the efficacy of MotionDirector in achieving superior motion customization capabilities:

Performance on Diverse Baselines: Using two foundational models, ModelScope and ZeroScope, MotionDirector consistently outperforms baseline models and other adaptation techniques, notably Tune-A-Video. It maintains a high degree of appearance diversity and motion fidelity.
Versatility of LoRAs: With modest computational resources, MotionDirector efficiently trains LoRAs on single and multiple videos. This adaptability confirms its practical utility in varied scenarios, allowing for rapid retraining and deployment.
Human and Automatic Evaluations: Human preference metrics indicate a strong favor for MotionDirector, particularly in maintaining motion fidelity and appearance diversity, while automatic evaluations support these findings with metrics such as CLIP score and PickScore.

Implications and Future Directions

The implications of MotionDirector span both theoretical and practical domains. Theoretically, it advances diffusion model-based video generation by highlighting the importance of separating motion learning from appearance learning. Practically, this approach enables more flexible and user-controlled video generation, responding to personalized needs for specific motion concepts without constraining creativity to predefined visual templates.

Future work may explore extending this model to more complex motion scenarios involving multiple interacting objects or subjects. Additionally, further refinement of the appearance-debiasing process may yield even greater integration of sophisticated motion in diverse contexts without additional computational burden.

In conclusion, MotionDirector represents a significant step forward in customizing video motion generation, achieving a delicate balance between appearance diversity and motion specificity, and paving the way for more nuanced and adaptable video generation systems in research and industry applications.