SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation
Abstract: Human beings are endowed with a complementary learning system, which bridges the slow learning of general world dynamics with fast storage of episodic memory from a new experience. Previous video generation models, however, primarily focus on slow learning by pre-training on vast amounts of data, overlooking the fast learning phase crucial for episodic memory storage. This oversight leads to inconsistencies across temporally distant frames when generating longer videos, as these frames fall beyond the model's context window. To this end, we introduce SlowFast-VGen, a novel dual-speed learning system for action-driven long video generation. Our approach incorporates a masked conditional video diffusion model for the slow learning of world dynamics, alongside an inference-time fast learning strategy based on a temporal LoRA module. Specifically, the fast learning process updates its temporal LoRA parameters based on local inputs and outputs, thereby efficiently storing episodic memory in its parameters. We further propose a slow-fast learning loop algorithm that seamlessly integrates the inner fast learning loop into the outer slow learning loop, enabling the recall of prior multi-episode experiences for context-aware skill learning. To facilitate the slow learning of an approximate world model, we collect a large-scale dataset of 200k videos with language action annotations, covering a wide range of scenarios. Extensive experiments show that SlowFast-VGen outperforms baselines across various metrics for action-driven video generation, achieving an FVD score of 514 compared to 782, and maintaining consistency in longer videos, with an average of 0.37 scene cuts versus 0.89. The slow-fast learning loop algorithm significantly enhances performances on long-horizon planning tasks as well. Project Website: https://slowfast-vgen.github.io
- Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023a.
- Align your latents: High-resolution video synthesis with latent diffusion models, 2023b. URL https://arxiv.org/abs/2304.08818.
- Seine: Short-to-long video diffusion model for generative transition and prediction, 2023. URL https://arxiv.org/abs/2310.20700.
- Open x-embodiment: Robotic learning datasets and rt-x models, 2024. URL https://arxiv.org/abs/2310.08864.
- Scaling egocentric vision: The epic-kitchens dataset. In European Conference on Computer Vision (ECCV), 2018.
- Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision (IJCV), 130:33–55, 2022. URL https://doi.org/10.1007/s11263-021-01531-2.
- Learning universal policies via text-guided video generation, 2023. URL https://arxiv.org/abs/2302.00111.
- Epic Games. Unreal engine. URL https://www.unrealengine.com.
- Structure and content-guided video synthesis with diffusion models, 2023. URL https://arxiv.org/abs/2302.03011.
- Long video generation with time-agnostic vqgan and time-sensitive transformer, 2022. URL https://arxiv.org/abs/2204.03638.
- Sparsectrl: Adding sparse controls to text-to-video diffusion models, 2023a. URL https://arxiv.org/abs/2311.16933.
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning, 2023b.
- Flexible diffusion modeling of long videos, 2022. URL https://arxiv.org/abs/2205.11495.
- Deep residual learning for image recognition, 2015. URL https://arxiv.org/abs/1512.03385.
- Latent video diffusion models for high-fidelity long video generation, 2023. URL https://arxiv.org/abs/2211.13221.
- Streamingt2v: Consistent, dynamic, and extendable long video generation from text, 2024. URL https://arxiv.org/abs/2403.14773.
- Imagen video: High definition video generation with diffusion models, 2022a. URL https://arxiv.org/abs/2210.02303.
- Video diffusion models, 2022b. URL https://arxiv.org/abs/2204.03458.
- Gaia-1: A generative world model for autonomous driving, 2023. URL https://arxiv.org/abs/2309.17080.
- Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685.
- Rlbench: The robot learning benchmark and learning environment, 2019. URL https://arxiv.org/abs/1909.12271.
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators, 2023. URL https://arxiv.org/abs/2303.13439.
- Mechanisms of systems memory consolidation during sleep. Nature Neuroscience, 22:1598–1610, October 2019. doi: 10.1038/s41593-019-0467-3. Received 18 February 2019; Accepted 12 July 2019; Published 26 August 2019.
- Learning to act from actionless videos through dense correspondences, 2023. URL https://arxiv.org/abs/2310.08576.
- What learning systems do intelligent agents need? complementary learning systems theory updated. Trends in Cognitive Sciences, 20(7):512–534, July 2016. doi: 10.1016/j.tics.2016.05.004.
- Videofusion: Decomposed diffusion models for high-quality video generation, 2023. URL https://arxiv.org/abs/2303.08320.
- Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychological Review, 102(3):419–457, July 1995. doi: 10.1037/0033-295X.102.3.419.
- Günther Palm. Neural associative memories and sparse coding. Neural Networks, 37:165–171, 2013.
- PySceneDetect. Pyscenedetect. URL https://www.scenedetect.com/. Accessed: 2024-03-03.
- Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020.
- Toward driving scene understanding: A dataset for learning driver behavior and causal reasoning. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Consisti2v: Enhancing visual consistency for image-to-video generation, 2024. URL https://arxiv.org/abs/2402.04324.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022.
- U-net: Convolutional networks for biomedical image segmentation, 2015. URL https://arxiv.org/abs/1505.04597.
- Runway. Runway. URL https://runwayml.com/. Accessed: 2024-03-03.
- Gladys C. Schwesinger. Review of ”psychological development” by norman l. munn. Pedagogical Seminary and Journal of Genetic Psychology, 55:xxx–xxx, 1955.
- Make-a-video: Text-to-video generation without text-video data, 2022. URL https://arxiv.org/abs/2209.14792.
- Endel Tulving. Elements of Episodic Memory. Oxford University Press, 1983.
- Phenaki: Variable length video generation from open domain textual description, 2022. URL https://arxiv.org/abs/2210.02399.
- Mcvd: Masked conditional video diffusion for prediction, generation, and interpolation, 2022. URL https://arxiv.org/abs/2205.09853.
- Modelscope text-to-video technical report, 2023. URL https://arxiv.org/abs/2308.06571.
- Worlddreamer: Towards general world models for video generation via predicting masked tokens, 2024a. URL https://arxiv.org/abs/2401.09985.
- With greater text comes greater necessity: Inference-time training helps long text generation, 2024b. URL https://arxiv.org/abs/2401.11504.
- ivideogpt: Interactive videogpts are scalable world models, 2024. URL https://arxiv.org/abs/2405.15223.
- Pandora: Towards general world model with natural language actions and video states, 2024. URL https://arxiv.org/abs/2406.09455.
- Learning interactive real-world simulators, 2024. URL https://arxiv.org/abs/2310.06114.
- Rerender a video: Zero-shot text-guided video-to-video translation, 2023. URL https://arxiv.org/abs/2306.07954.
- Nuwa-xl: Diffusion over diffusion for extremely long video generation, 2023. URL https://arxiv.org/abs/2303.12346.
- Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning, 2021. URL https://arxiv.org/abs/1910.10897.
- Make pixels dance: High-dynamic video generation, 2023. URL https://arxiv.org/abs/2311.10982.
- Magicvideo: Efficient video generation with latent diffusion models, 2023. URL https://arxiv.org/abs/2211.11018.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.