Long-Horizon Video Generation for Robotics

Determine training and inference techniques that enable diffusion and flow-matching video generation models used as embodied world models in robotics to produce minutes-long videos with sustained temporal coherence and physical consistency, avoiding artifacts introduced by stitching multiple short clips and overcoming current limits of only a few seconds of generation.

Background

To serve as effective world models for robotic manipulation and planning, video models must forecast over task durations that are often minutes long. However, state-of-the-art systems currently generate only short clips (e.g., 8–10 seconds), which is insufficient for informed decision-making in robotics.

Existing pipelines that extend longer sequences by stitching multiple short clips often introduce artifacts that degrade temporal coherence and physical realism. Despite advances in compression, memory mechanisms, and hierarchical generation, reliably scaling video models to longer horizons remains unresolved for robotics applications.

References

While SOTA video models excel in short-duration video generation tasks, scaling these models to longer horizons for robotics tasks remains an open challenge.

Video Generation Models in Robotics -- Applications, Research Challenges, Future Directions  (2601.07823 - Mei et al., 12 Jan 2026) in Subsection 7.8 (Long Video Generation)