Long-duration video generation

Develop methods for generating long-duration videos that maintain temporal consistency across frames over extended horizons, addressing the challenge that model performance degrades when only short conditioning histories are available.

Background

The paper introduces C-Cubed, an uncertainty quantification method for action-conditioned controllable video diffusion models. While the approach provides calibrated, dense confidence estimates, the authors note that temporal consistency of these estimates (and the generated video content) can degrade over longer horizons when the model is conditioned on limited history.

Within the limitations section, the authors explicitly state that long-duration video generation is still an open research problem. This is motivated by practical issues observed in their setting: with shorter historical contexts as input, the method may lose track of uncertain video patches over time, underscoring the need for advances that sustain temporal coherence over extended sequences.

References

Long-duration video generation remains an open research problem, which will be explored in future work.

World Models That Know When They Don't Know: Controllable Video Generation with Calibrated Uncertainty  (2512.05927 - Mei et al., 5 Dec 2025) in Section 7 (Limitations and Future Work), Long-Duration Video Generation