Stability of long-horizon video generation in autoregressive diffusion models

Develop training and architectural techniques that improve the stability of long-horizon video generation in autoregressive diffusion-based models, including the distilled causal-attention version of StereoWorld, so that visual quality does not noticeably degrade as sequence length increases in both stereo and monocular settings.

Background

StereoWorld is distilled from a bidirectional attention diffusion model into a causal, autoregressive generator to enable long-horizon stereo video synthesis with a KV-cache, following Self-Forcing-style distillation. This yields significant speed-ups (from 0.49 FPS to approximately 5 FPS) and removes the 49-frame limit.

Despite these gains, the authors observe quality degradation as the generated video length increases. They note that this issue is also present in prior work (e.g., Self-Forcing), and explicitly identify stabilizing long-horizon generation as an unresolved challenge affecting both monocular and stereo video synthesis.

References

Improving the stability of long-horizon video generation therefore remains an open challenge shared by both monocular and stereo video synthesis.

— Stereo World Model: Camera-Guided Stereo Video Generation (2603.17375 - Sun et al., 18 Mar 2026) in Supplementary Material, Section "Long Video Distillation"

Stability of long-horizon video generation in autoregressive diffusion models

Background

References

Related Problems