Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism

Published 31 Mar 2026 in cs.CV and cs.AI | (2603.29252v1)

Abstract: Long video understanding is a key challenge that plagues the advancement of \emph{Multimodal LLMs} (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and training-free approach, termed \emph{Flexible Memory} (\textbf{FlexMem}). In principle, FlexMem aims to mimic human behavior of video watching, \emph{i.e.}, continually watching video content and recalling the most relevant memory fragments to answer the question. In this way, FlexMem can help MLLMs achieve video understanding of infinite lengths, unlike previous methods that process all video information at once and have input upper-limit. Concretely, FlexMem first consider the visual KV caches as the memory sources, and realize the effective memory transfer and writing via a dual-pathway compression design. Afterwards, FlexMem also explores different memory reading strategies for the diverse video understanding tasks, including the popular streaming one. To validate FlexMem, we apply it to two popular video-MLLMs, and conduct extensive experiments on five long video and one streaming video task. The experimental results show that on \textbf{a single 3090 GPU}, our FlexMem can achieve obvious improvements than existing efficient video understanding methods and process more than \textbf{1k frames}, which also helps the base MLLMs achieve comparable or even better performance than SOTA MLLMs on some benchmarks, \emph{e.g.} , GPT-4o and Gemini-1.5 Pro.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces FlexMem, a training-free visual memory mechanism that iteratively encodes video into context and local memory for scalable long video understanding.
It employs a dual-pathway compression strategy using cross-modal attention to enhance feature saliency and maintain temporal context without retraining.
FlexMem demonstrates resource efficiency by processing over 1000 frames per GPU while significantly improving accuracy on multiple long video benchmarks.

Scaling Long Video Understanding for MLLMs via Visual Memory: An Analysis of FlexMem

Introduction

The exponential growth in the complexity and application scope of Multimodal LLMs (MLLMs) has foregrounded the challenge of long video understanding. Existing MLLMs are fundamentally constrained by input sequence length and computational/memory overhead, limiting their applicability to real-world scenarios involving extensive temporal and contextual dependencies. The paper "Scaling the Long Video Understanding of Multimodal LLMs via Visual Memory Mechanism" (2603.29252) addresses these challenges through a novel, training-free visual memory architecture termed Flexible Memory (FlexMem).

Motivation and Problem Context

Current leading strategies for handling long video sequences in MLLMs can be categorized broadly into retrieval-augmented generation (RAG) approaches and visual feature compression methods. RAG methods, although effective in localizing salient events or frames, struggle with holistic, continual temporal understanding due to their reliance on a knowledge base abstraction and query-frame matching. Visual compression alternatives increase context length by representing long sequences more compactly, yet scale linearly in resource use and thus remain limited by hardware constraints.

The central problem is: How can one enable MLLMs to process videos of theoretically infinite length for arbitrary question answering, without incurring prohibitive computational or architectural burdens?

FlexMem: Technical Contributions

FlexMem is predicated on the principle of emulating a human-like memory mechanism—processing video incrementally, forming and updating structured memories, and retrieving the most relevant fragments for downstream reasoning. The core innovations of FlexMem are:

Iterative Visual Memory Construction: The video is parsed into tempo-spatial clips, and at each iteration, visual information is encoded into two types of compressed memory: context memory ( $C_i$ ) for propagating sequential history, and local memory ( $M_i$ ) as a persistent summary stored in a visual memory bank ( $M_\text{bank}$ ).
Dual-Pathway Compression (DPC): Distinct attention-driven token selection metrics are applied for context and local memory generation. Context compression maximizes historical information transfer (context aggregation score), whereas local compression enhances representational saliency within each clip (local saliency score). Both are derived via cross-modal attention matrices.
Flexible Memory Reading Mechanisms:
- Encoding-Based: For each question, cross-attention between the query and stored memory determines relevance, facilitating precise memory recall for answer generation.
- MemIndex – Fast Memory Indexing: To avoid redundant inference for multiple queries, a compact, independently-computed indexing mechanism statistically fits the encoding-based metric, selecting representative layers and tokens for efficient retrieval with negligible performance loss.
Scalability and Training-Free Plug-and-Play: FlexMem does not require model retraining or architecture modifications. It operates purely at the inference level, making it practical for widespread deployment.

Quantitative Results and Empirical Claims

Across five established long video benchmarks (MLVU, TimeScope, LVBench, Video-MME, LongVideoBench), FlexMem yields significant gains:

Frame Scalability: Processes >1000 frames per GPU (NVIDIA 3090, 24GB RAM) with constant memory footprint, surpassing the limitations of AdaRETAKE and RAG-based methods by 3.9–5.2% on LLaVA-Video (2603.29252).
Benchmark Gains: Achieves +32.2% absolute improvement on TimeScope, and +19.7% on LVBench for LLaVA-Video over the baseline, with performance approaching or exceeding closed-source SOTA models like GPT-4o and Gemini-1.5 Pro.
Resource Efficiency: Maintains 99.5% of unconstrained performance under strict hardware constraints, whereas other methods degrade significantly due to input compression or sampling strategy limitations.
Online/Streaming QA: When integrated with MemIndex, FlexMem materially improves streaming and historical memory tasks (e.g., OVOBench backward tracing).

These results are significant because they are achieved without additional model training or specialized hardware.

Theoretical and Practical Implications

Theoretical Impact

FlexMem demonstrates that vision-LLMs can be augmented with iterative, structure-preserving memory mechanisms to overcome fixed-sequence bottlenecks. The dual-pathway framework aligns with the distinction between working and episodic memory in cognitive architectures. Practically, these techniques open new research avenues for:

Memory-augmented LLM architectures that decouple memory formation and retrieval.
Attention-based adaptive compression for scalable, information-preserving VL understanding in infinite-length contexts.
Task-specific memory indexing mechanisms bridging the gap between offline and online/streaming video-language reasoning.

Practical Scenarios and Deployment

Enterprise Video Analytics: Surveillance, meeting summarization, and event detection systems benefit directly from the ability to handle hour-long or continuous feeds.
Streaming and Real-Time Reasoning: MemIndex allows multiple simultaneous queries with minimal recomputation, suiting scenarios like dense temporal querying, interactive video dialogue, and forensic video search.
Low-Resource Edge Applications: FlexMem’s training-free, plug-and-play nature is particularly advantageous for edge deployment with fixed computational budgets.

Limitations and Future Outlook

FlexMem, while training-free, depends on the quality of underlying MLLM cross-modal alignment and attention maps. Potential areas for further exploration include:

Injecting explicit temporal reasoning or hierarchical memory for further scalability.
End-to-end differentiable memory read-write strategies, tightly coupled with model training, for specialized domains.
Extension of the dual-pathway approach to audio and sensor modalities, generalizing to all forms of multimodal sequential reasoning.

Beyond these, the methodology suggests broader directions in lifelong and continual learning for MLLMs and robust handling of data streams with variable semantics and noise.

Conclusion

FlexMem marks an advance in the long-standing challenge of long video understanding for MLLMs by introducing a principled, efficient, and model-agnostic visual memory abstraction. It achieves measurable performance improvements under realistic hardware constraints and enables previously infeasible long-horizon video reasoning. As MLLMs evolve towards generalist agents operating in temporally extended environments, such memory architectures are poised to become foundational, informing both theory and application of multimodal AI.

Reference: "Scaling the Long Video Understanding of Multimodal LLMs via Visual Memory Mechanism" (2603.29252).

Markdown Report Issue