Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control

Published 27 May 2024 in cs.CV and cs.GR | (2405.17414v1)

Abstract: Research on video generation has recently made tremendous progress, enabling high-quality videos to be generated from text prompts or images. Adding control to the video generation process is an important goal moving forward and recent approaches that condition video generation models on camera trajectories make strides towards it. Yet, it remains challenging to generate a video of the same scene from multiple different camera trajectories. Solutions to this multi-video generation problem could enable large-scale 3D scene generation with editable camera trajectories, among other applications. We introduce collaborative video diffusion (CVD) as an important step towards this vision. The CVD framework includes a novel cross-video synchronization module that promotes consistency between corresponding frames of the same video rendered from different camera poses using an epipolar attention mechanism. Trained on top of a state-of-the-art camera-control module for video generation, CVD generates multiple videos rendered from different camera trajectories with significantly better consistency than baselines, as shown in extensive experiments. Project page: https://collaborativevideodiffusion.github.io/.

Abstract PDF HTML Upgrade to Chat

References (64)

Citations (12)

View on Semantic Scholar

Summary

The paper introduces a novel Collaborative Video Diffusion framework that ensures multi-view video consistency using an epipolar attention mechanism.
It leverages a hybrid training strategy with RealEstate10K and WebVid10M to optimize static indoor and dynamic outdoor scene generation.
Experimental results highlight significant improvements in geometric and semantic consistency, promising advancements in 3D scene reconstruction and VR applications.

Collaborative Video Diffusion for Multi-View Consistency with Camera Control

The paper presents a novel approach called Collaborative Video Diffusion (CVD), designed to address the challenge of generating multiple videos of the same scene from different camera trajectories while maintaining consistency. The work builds upon recent advances in video generation, particularly leveraging diffusion models and camera control technologies.

Introduction

Recent progress in diffusion models has significantly advanced video generation quality. Models like SORA demonstrate the capability to generate high-quality videos with complex dynamics, primarily controlled through text or image inputs. However, these methods lack precise control over both camera movements and content, which is vital for practical applications. Prior works have explored conditioning video generation on various inputs but have not yet satisfactorily addressed camera control.

The need for consistent multi-view video generation is apparent in several applications, such as large-scale 3D scene generation. Existing approaches like MotionCtrl and CameraCtrl have made initial strides in camera control by conditioning video generative models on sequences of camera poses. However, these are limited to single-camera trajectories, resulting in inconsistencies when generating multiple videos of the same scene.

Methodology

The proposed CVD framework introduces several key innovations to achieve coherent multi-view video generation:

Cross-Video Synchronization Module: To ensure consistency between frames of videos rendered from different camera poses, the paper introduces a cross-video synchronization module. This module employs an epipolar attention mechanism, which aligns features across frames based on the fundamental matrix of corresponding camera poses.
Hybrid Training Strategy: The training procedure utilizes two datasets: RealEstate10K, which provides camera-calibrated static indoor scenes, and WebVid10M, offering a diverse array of dynamic scenes without camera poses. The model is trained in two phases:
- Phase one uses video folding to create synchronized video pairs from RealEstate10K.
- Phase two applies homography transformations to WebVid10M videos to simulate camera movements, thus enabling dynamic scene training.
Collaborative Inference Algorithm: The model extends from generating video pairs to handling an arbitrary number of videos via a collaborative inference algorithm. This algorithm selects pairs of video features at each denoising step and averages the noise predictions across selected pairs to ensure consistency.

Experimental Results

The paper's extensive experiments demonstrate several strengths of the CVD framework. Quantitatively, CVD outperforms baseline methods in terms of geometric and semantic consistency across multiple criteria. For instance:

On RealEstate10K scenes, CVD achieves higher accuracy in the area under the cumulative error curve (AUC) for both rotation and translation errors compared to CameraCtrl and MotionCtrl.
In dynamic scene evaluation using WebVid10M prompts, CVD maintains superior cross-video geometric consistency, showcasing the effectiveness of its epipolar attention mechanism.
The model also excels in preserving content fidelity and semantic matching as evidenced by CLIP metrics in the experiments.

Qualitatively, CVD delivers consistent visual content across videos with different camera trajectories, including dynamic elements such as waves and lightning.

Implications and Future Work

The implications of this work span both practical and theoretical realms. Practically, CVD can enhance applications in digital content creation, virtual reality, and 3D scene reconstruction by providing high-quality, coherent multi-view videos. Theoretically, the framework paves the way for integrating more sophisticated camera control mechanisms within generative models, stimulating further research in this direction.

Future developments could involve scaling up the model to handle more complex scenes and integrating real-world dynamic camera data to refine the cross-video synchronization. Enhancements in user control over camera trajectories could also be explored to make the video generation process more intuitive and precise.

Conclusion

In conclusion, the Collaborative Video Diffusion framework represents a significant step towards consistent multi-view video generation with camera control. By introducing cross-video synchronization through epipolar attention and deploying a hybrid training strategy, the model achieves remarkable performance improvements over existing methods. This research opens new avenues for applications requiring high-fidelity, multi-perspective video content, bolstering the capabilities of generative AI in visual computing.