WorldMM: Dynamic Multimodal Memory Agent
- The paper introduces WorldMM, a dynamic multimodal memory agent integrating episodic, semantic, and visual memory for long-horizon video analysis.
- It employs adaptive retrieval strategies and multi-scale temporal indexing to iteratively refine responses based on query context.
- Experimental results demonstrate significant accuracy gains on video question-answering benchmarks compared to single-step methods.
WorldMM is a dynamic multimodal memory agent architected to address the unique demands of long video understanding, particularly in scenarios that require reasoning over hours or days of visual and semantic content. Unlike earlier approaches that rely predominantly on textual summaries and fixed temporal scales, WorldMM integrates multi-scale episodic, semantic, and high-fidelity visual memories under the control of an adaptive retrieval agent, resulting in significant performance improvements on long-horizon video question-answering tasks (Yeo et al., 2 Dec 2025).
1. System Architecture and Workflow
WorldMM comprises three principal subsystems: multimodal memory construction, adaptive memory retrieval, and response generation. During ingestion, video streams are processed into complementary memory representations:
- Episodic Memory: Multiple directed textual knowledge graphs (KGs) indexed at several temporal resolutions , each capturing (subject, verb, object) triplets extracted by a video LLM.
- Semantic Memory: A single evolving KG consolidates high-level conceptual and habitual facts across segments, updated continually via LLM-driven merges.
- Visual Memory: A dual-index system stores (a) segment-level feature embeddings and (b) precise timestamp→frame image pairs for visual evidence.
During inference, an adaptive retrieval agent, parameterized by a sequence of retrieval decisions, dynamically selects the best memory source (episodic, semantic, or visual) and temporal granularity according to the query and retrieval history. This agent iteratively issues sub-queries until it determines that enough information has been gathered, at which point a response agent (LLM) generates the answer using the assembled evidence.
Memory Construction and Retrieval Pipeline
1 2 3 4 5 6 7 8 9 10 11 |
Inputs: question q, memories {M_e, M_s, M_v}, max steps N
r_history ← []
for i in 1…N:
(decision, m_i, sub_query) ← RetrievalAgent(q, r_history)
if decision == STOP:
break
r_i ← RetrieveFromMemory(m_i, sub_query)
append r_i to r_history
end
Answer ← ResponseAgent(q, r_history)
return Answer |
2. Memory Types and Representations
WorldMM formalizes three distinct and complementary memory structures, optimized for variable-duration events and multimodal evidence.
2.1 Episodic Memory
For each temporal scale , the video is segmented and each chunk is captioned, followed by triplet extraction to build a KG . Episodic retrieval is executed via Personalized PageRank (PPR), scoring nodes against query modeled as a pseudo-node. Top- candidate captions/triplets are reranked across scales by an LLM. The relevance score for node is:
where denotes KG edge weights (Yeo et al., 2 Dec 2025).
2.2 Semantic Memory
Using a fixed coarse scale , semantic triplets are extracted and consolidated via LLM prompts. The evolving graph is updated as:
where deletions and updates are determined by comparing new and existing embeddings. Retrieval scores per edge aggregate PPR across source and target nodes (Yeo et al., 2 Dec 2025).
2.3 Visual Memory
Visual memory consists of:
- Feature-based index: Segments () yield embeddings .
- Timestamp-based index: Stores precise pairs for frame-level retrieval.
Semantic queries are converted to text embeddings for cosine-similarity search:
Direct retrieval by timestamp is supported for frame lookup (Yeo et al., 2 Dec 2025).
3. Adaptive Retrieval Strategies
The retrieval agent operates in an iterative mode, deciding at each step whether the answer is sufficiently supported, which memory type and granularity to probe next, and whether to stop or refine the context. The agent leverages a few-shot LLM prompt with awareness of retrieval history and query intent. Decision criteria include sufficiency of current evidence, choice of modality (event, semantic, visual), and adaptability of temporal scope.
1 2 3 4 5 6 |
def RetrievalAgent(q, history): if HistoryContainsAnswerableEvidence(history): return (STOP, –, –) else: output ← LLM_decision(...) return (output.memory_type, output.search_query) |
This iterative, multi-modal, and multi-scale approach enables complex query resolution where events may span arbitrarily long intervals or require visual confirmation.
4. Multi-Scale Temporal Indexing and Controlled Retrieval
WorldMM utilizes a compositional strategy for temporal indexing, constructing overlapping segmentations at granularities suited to capture both fine and coarse temporal dependencies. Through dynamic retrieval, the agent can localize broad intervals via coarse scales and refine with finer segment retrieval. This is operationalized using temporal intersection-over-union (tIoU) metrics to assess retrieval span accuracy:
WorldMM achieves average tIoU ≈ 10%, outperforming baselines by a factor of two to three (Yeo et al., 2 Dec 2025).
5. Experimental Results and Comparative Performance
WorldMM demonstrates superior accuracy and robustness on five long video reasoning benchmarks. Using GPT-5 as the retrieval and response agent (Yeo et al., 2 Dec 2025):
| Model | EgoLifeQA | Ego-R1 | HippoVlog | LVBench | VideoMME(L) | Avg |
|---|---|---|---|---|---|---|
| M3-Agent | 53.5 | 52.0 | 65.5 | 49.3 | 55.3 | 55.1 |
| WorldMM-8B | 56.4 | 52.0 | 69.7 | 55.4 | 66.0 | 59.9 |
| WorldMM-GPT | 65.6 | 65.3 | 78.3 | 61.9 | 76.6 | 69.5 |
Multi-turn retrieval further increases accuracy by 9.3% over single-step approaches. Ablations indicate that the joint use of episodic, semantic, and visual memory improves performance beyond any subset alone; e.g., episodic-only , full model on EgoLifeQA (Yeo et al., 2 Dec 2025).
6. Comparison to Related Multimodal Memory Systems
WorldMM builds on lessons from preceding agents:
- MIRIX: Modular memory structures (episodic, semantic, procedural, resource, KV) coordinated by a multi-agent protocol informed WorldMM’s multi-type, schema-based organization and compaction policies using cosine similarity, time decay, and importance scoring (Wang et al., 10 Jul 2025).
- TeleMem: Batched writing, narrative-grounded extraction, and ReAct-style integration for text and video memory suggest structured clustering and flexible retrieval layers for efficient memory operations (Chen et al., 12 Dec 2025).
- MemVerse: Hierarchical knowledge graphs with continual consolidation, adaptive forgetting, and periodic parametric distillation motivated WorldMM’s approach to scalable KG management and embedding-based retrieval (Liu et al., 3 Dec 2025).
- ViLoMem: Dual-stream separation of visual and logical error patterns, with grow-and-refine updating and explicit two-stage retrieval, demonstrated the value of error-aware multi-store memory for reasoning accuracy (Bo et al., 26 Nov 2025).
WorldMM’s joint reliance on LLM-driven extraction, KG-based consolidation, iterative, multi-scale retrieval, and multimodal (text + visual) evidence sets it apart in terms of scalability and context fidelity.
7. Limitations and Areas for Further Research
WorldMM incurs substantial preprocessing costs due to multi-scale KG construction and semantic graph consolidation, though these operations are amenable to online pipelining. The system’s reliance on LLM heuristics and prompt engineering for both consolidation and adaptive retrieval introduces susceptibility to prompt-induced failure propagation. Privacy concerns arise in real-time deployments, particularly regarding the semantic graph’s potential to encode sensitive behavioral patterns. Further research is directed toward optimizing preprocessing, automating heuristic calibration, and developing robust privacy protections (Yeo et al., 2 Dec 2025).
Summary Table: WorldMM Core Components
| Component | Function | Technique |
|---|---|---|
| Episodic Memory | Event filtering, multi-scale KGs | PPR, LLM triplet extraction |
| Semantic Memory | Habitual fact consolidation and evolution | LLM-based KG merges |
| Visual Memory | Detailed scene/frame evidence | Feature embedding, timestamp index |
| Retrieval Agent | Iterative, adaptive, cross-modal query | LLM few-shot decision prompting |
WorldMM thus provides a principled, modular framework for dynamic, multimodal memory-based reasoning over long-horizon video, advancing context-aware agentic capabilities across several complex benchmarks (Yeo et al., 2 Dec 2025).