FIFO-Diffusion: Generating Infinite Videos from Text without Training
Abstract: We propose a novel inference technique based on a pretrained diffusion model for text-conditional video generation. Our approach, called FIFO-Diffusion, is conceptually capable of generating infinitely long videos without additional training. This is achieved by iteratively performing diagonal denoising, which simultaneously processes a series of consecutive frames with increasing noise levels in a queue; our method dequeues a fully denoised frame at the head while enqueuing a new random noise frame at the tail. However, diagonal denoising is a double-edged sword as the frames near the tail can take advantage of cleaner frames by forward reference but such a strategy induces the discrepancy between training and inference. Hence, we introduce latent partitioning to reduce the training-inference gap and lookahead denoising to leverage the benefit of forward referencing. Practically, FIFO-Diffusion consumes a constant amount of memory regardless of the target video length given a baseline model, while well-suited for parallel inference on multiple GPUs. We have demonstrated the promising results and effectiveness of the proposed methods on existing text-to-video generation baselines. Generated video examples and source codes are available at our project page.
- Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, 2023.
- VideoCrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023a.
- VideoCrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047, 2024.
- Seine: Short-to-long video diffusion model for generative transition and prediction. arXiv preprint arXiv:2310.20700, 2023b.
- Diffusion models beat GANs on image synthesis. In NeurIPS, 2021.
- Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.
- Flexible diffusion modeling of long videos. In NeurIPS, 2022.
- Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2022.
- Denoising diffusion probabilistic models. In NeurIPS, 2020.
- Video diffusion models. In NeurIPS, 2022.
- CogVideo: Large-scale pretraining for text-to-video generation via transformers. In ICLR, 2023.
- Elucidating the design space of diffusion-based generative models. In NeurIPS, 2022.
- Videofusion: Decomposed diffusion models for high-quality video generation. In CVPR, 2023.
- OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Scalable diffusion models with transformers. In ICCV, 2023.
- FreeNoise: Tuning-free longer video diffusion via noise rescheduling. 2023.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- U-Net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
- Make-A-Video: Text-to-video generation without text-video data. In ICLR, 2022.
- Denoising diffusion implicit models. In ICLR, 2021a.
- Score-based generative modeling through stochastic differential equations. In ICLR, 2021b.
- Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.0171, 2018.
- Phenaki: Variable length video generation from open domain textual description. In ICLR, 2023.
- MCVD: Masked conditional video diffusion for prediction, generation, and interpolation. In NeurIPS, 2022.
- Gen-L-Video: Multi-text to long video generation via temporal co-denoising. arXiv preprint arXiv:2305.18264, 2023a.
- ModelScope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023b.
- LaVie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023c.
- MSR-VTT: A large video description dataset for bridging video and language. In CVPR, 2016.
- Video diffusion models with local-global context guidance. In IJCAI, 2023.
- NUWA-XL: Diffusion over diffusion for extremely long video generation. arXiv preprint arXiv:2303.12346, 2023.
- MagicVideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.