Using historical context without harming present-scene perception in streaming video understanding

Develop streaming video understanding techniques for vision–language models operating under causal online constraints that leverage historical video context while preserving current-scene understanding; specifically, determine how to integrate past frames or memory representations without degrading real-time perception performance.

Background

The paper introduces SimpleStream, a minimal recent-context baseline that answers each streaming query with only the last N frames and an off-the-shelf vision–LLM. Across OVO-Bench and StreamingBench, SimpleStream outperforms recent streaming systems while using less memory and maintaining competitive latency.

Analyses show that adding longer or retrieved historical context is not uniformly beneficial and often induces a perception–memory trade-off: memory-oriented gains frequently coincide with degraded current-scene perception. This motivates an explicit open problem of exploiting history without harming present-scene understanding.

References

The central open problem is not how to add more memory, but how to use history without degrading current-scene understanding.

A Simple Baseline for Streaming Video Understanding  (2604.02317 - Shen et al., 2 Apr 2026) in Conclusion (Section 7)