Bound KV Cache While Preserving Large Effective Context in Autoregressive Video Generation

Develop methods for autoregressive video diffusion models that maintain a large effective attention context window while strictly bounding the per-layer key–value (KV) cache size, so that long-range coherence is preserved without unbounded memory growth during minute-scale video generation.

Background

Autoregressive video generation caches key–value pairs from previously generated blocks to provide long-range context, but the KV cache grows linearly with video length. Truncation or sliding windows cap memory but degrade long-range coherence; retaining all history leads to out-of-memory failures for minute-scale videos. Balancing these competing demands is presented as a critical open issue motivating the paper’s design.

The paper proposes PackForcing as a solution via a three-partition KV cache and learned compression, but it explicitly frames the general challenge of simultaneously preserving a large effective context and bounding KV cache size as an open problem.

References

Maintaining a large effective context window while strictly bounding the KV cache size remains a critical open problem.

PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference  (2603.25730 - Mao et al., 26 Mar 2026) in Section 1 (Introduction)