Learned Importance Predictors for Dynamic Context Selection

Develop learned importance prediction models for PackForcing’s dynamic context selection that more reliably capture visual saliency than attention-based affinity scoring, thereby improving which compressed mid-range blocks are retrieved during generation.

Background

The method prioritizes historical mid tokens via dynamic context selection based on query–key affinity derived from attention scores. While efficient, this heuristic may not fully reflect visual saliency or task relevance.

The authors explicitly flag the limitation and propose learned importance predictors as an open direction to enhance selection quality beyond attention-based scoring.

References

Several directions remain open: (i)~the fixed compression ratio ($128\times$ volume / ${\sim}32\times$ token) could be made adaptive to scene complexity; (ii)~attention-based importance scoring may not capture all aspects of visual saliency---learned importance predictors could help; (iii)~scaling to higher resolutions (e.g., $1920{\times}1080$) requires investigating the interaction between spatial compression and quality.

PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference  (2603.25730 - Mao et al., 26 Mar 2026) in Appendix, Extended Discussion on Limitations