Collaborative Video Diffusion: Consistent Multi-video Generation with Camera Control
Abstract: Research on video generation has recently made tremendous progress, enabling high-quality videos to be generated from text prompts or images. Adding control to the video generation process is an important goal moving forward and recent approaches that condition video generation models on camera trajectories make strides towards it. Yet, it remains challenging to generate a video of the same scene from multiple different camera trajectories. Solutions to this multi-video generation problem could enable large-scale 3D scene generation with editable camera trajectories, among other applications. We introduce collaborative video diffusion (CVD) as an important step towards this vision. The CVD framework includes a novel cross-video synchronization module that promotes consistency between corresponding frames of the same video rendered from different camera poses using an epipolar attention mechanism. Trained on top of a state-of-the-art camera-control module for video generation, CVD generates multiple videos rendered from different camera trajectories with significantly better consistency than baselines, as shown in extensive experiments. Project page: https://collaborativevideodiffusion.github.io/.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision, 2021.
- Universal guidance for diffusion models. In CVPR, 2023.
- Multidiffusion: Fusing diffusion paths for controlled image generation. In arXiv, 2023.
- Demystifying mmd gans. In arXiv, 2018.
- Stable video diffusion: Scaling latent video diffusion models to large datasets. In arXiv, 2023.
- Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
- Video generation models as world simulators. 2024.
- Generative rendering: Controllable 4d-guided video generation with 2d diffusion models. In CVPR, 2024.
- Diffdreamer: Towards consistent unsupervised single-view scene extrapolation with conditional diffusion models. In ICCV, 2023.
- Pix2video: Video editing using image diffusion. In ICCV, 2023.
- GeNVS: Generative novel view synthesis with 3D-aware diffusion models. In ICCV, 2023.
- Structure and content-guided video synthesis with diffusion models. In ICCV, 2023.
- Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
- Scenescape: Text-driven consistent scene generation. In arXiv, 2023.
- Preserve your own correlation: A noise prior for video diffusion models. In ICCV, 2023.
- Sparsectrl: Adding sparse controls to text-to-video diffusion models. In arXiv, 2023.
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. In arXiv, 2023.
- Cameractrl: Enabling camera control for text-to-video generation. In arXiv, 2024.
- Latent video diffusion models for high-fidelity long video generation. In arXiv, 2022.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
- Imagen video: High definition video generation with diffusion models. In arXiv, 2022.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Video diffusion models. In arXiv, 2022.
- Text2room: Extracting textured 3d meshes from 2d text-to-image models. In ICCV, 2023.
- Lora: Low-rank adaptation of large language models. In ICLR, 2022.
- Animate anyone: Consistent and controllable image-to-video synthesis for character animation. In arXiv, 2023.
- Elucidating the design space of diffusion-based generative models. In NeurIPS, 2022.
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators. In arXiv, 2023.
- D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
- Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model. In arxiv, 2023.
- Videogen: A reference-guided latent diffusion approach for high definition text-to-video generation. In arXiv, 2023.
- Infinitenature-zero: Learning perpetual view generation of natural scenes from single images. In ECCV, 2022.
- Infinite nature: Perpetual view generation of natural scenes from a single image. In ICCV, 2021.
- Zero-1-to-3: Zero-shot one image to 3d object. In arXiv, 2023.
- Syncdreamer: Generating multiview-consistent images from a single-view image. In arXiv, 2023.
- Wonder3d: Single image to 3d using cross-domain diffusion. In arXiv, 2023.
- High-fidelity performance metrics for generative models in pytorch, 2020.
- J. Plücker. Analytisch-Geometrische Entwicklungen. GD Baedeker, 1828.
- State of the art on diffusion models for visual computing. In arXiv, 2023.
- Learning transferable visual models from natural language supervision. In CoRR, 2021.
- A comparative analysis of ransac techniques leading to adaptive real-time random sample consensus. In ECCV, 2008.
- Hierarchical text-conditional image generation with clip latents. In arXiv, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. In arXiv, 2022.
- SuperGlue: Learning feature matching with graph neural networks. In CVPR, 2020.
- Zero123++: a single image to consistent multi-view diffusion base model. In arXiv, 2023.
- Make-a-video: Text-to-video generation without text-video data. In arXiv, 2022.
- Denoising diffusion implicit models. In ICLR, 2020.
- Score-based generative modeling through stochastic differential equations. In ICLR, 2020.
- Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. In ICCV, 2023.
- Diffusion with forward models: Solving stochastic inverse problems without direct supervision. In arXiv, 2023.
- Consistent view synthesis with pose-guided diffusion models. In CVPR, 2023.
- Attention is all you need. In NeurIPS, 2017.
- Videocomposer: Compositional video synthesis with motion controllability. In NeurIPS, 2023.
- Motionctrl: A unified and flexible motion controller for video generation. In arXiv, 2023.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In arXiv, 2022.
- Dmv3d: Denoising multi-view diffusion using 3d large reconstruction model. In arXiv, 2023.
- Dragnuwa: Fine-grained control in video generation by integrating text, image, and trajectory. In arXiv, 2023.
- Video probabilistic diffusion models in projected latent space. In CVPR, 2023.
- Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
- Diffcollage: Parallel generation of large content with diffusion models. In CVPR, 2023.
- Motiondirector: Motion customization of text-to-video diffusion models. In arXiv, 2023.
- Stereo magnification: Learning view synthesis using multiplane images. In SIGGRAPH, 2018.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.