WorldMM: Dynamic Multimodal Memory Agent

Updated 29 January 2026

The paper introduces WorldMM, a dynamic multimodal memory agent integrating episodic, semantic, and visual memory for long-horizon video analysis.
It employs adaptive retrieval strategies and multi-scale temporal indexing to iteratively refine responses based on query context.
Experimental results demonstrate significant accuracy gains on video question-answering benchmarks compared to single-step methods.

WorldMM is a dynamic multimodal memory agent architected to address the unique demands of long video understanding, particularly in scenarios that require reasoning over hours or days of visual and semantic content. Unlike earlier approaches that rely predominantly on textual summaries and fixed temporal scales, WorldMM integrates multi-scale episodic, semantic, and high-fidelity visual memories under the control of an adaptive retrieval agent, resulting in significant performance improvements on long-horizon video question-answering tasks (Yeo et al., 2 Dec 2025).

1. System Architecture and Workflow

WorldMM comprises three principal subsystems: multimodal memory construction, adaptive memory retrieval, and response generation. During ingestion, video streams are processed into complementary memory representations:

Episodic Memory: Multiple directed textual knowledge graphs (KGs) indexed at several temporal resolutions $\{t_0, t_1, ..., t_N\}$ , each capturing (subject, verb, object) triplets extracted by a video LLM.
Semantic Memory: A single evolving KG consolidates high-level conceptual and habitual facts across segments, updated continually via LLM-driven merges.
Visual Memory: A dual-index system stores (a) segment-level feature embeddings and (b) precise timestamp→frame image pairs for visual evidence.

During inference, an adaptive retrieval agent, parameterized by a sequence of retrieval decisions, dynamically selects the best memory source (episodic, semantic, or visual) and temporal granularity according to the query and retrieval history. This agent iteratively issues sub-queries until it determines that enough information has been gathered, at which point a response agent (LLM) generates the answer using the assembled evidence.

Memory Construction and Retrieval Pipeline

Inputs: question q, memories {M_e, M_s, M_v}, max steps N
r_history ← []
for i in 1…N:
    (decision, m_i, sub_query) ← RetrievalAgent(q, r_history)
    if decision == STOP:
        break
    r_i ← RetrieveFromMemory(m_i, sub_query)
    append r_i to r_history
end
Answer ← ResponseAgent(q, r_history)
return Answer

(Yeo et al., 2 Dec 2025)

2. Memory Types and Representations

WorldMM formalizes three distinct and complementary memory structures, optimized for variable-duration events and multimodal evidence.

2.1 Episodic Memory

For each temporal scale $t_i$ , the video is segmented and each chunk is captioned, followed by triplet extraction to build a KG $G_{t_i}$ . Episodic retrieval is executed via Personalized PageRank (PPR), scoring nodes against query $q$ modeled as a pseudo-node. Top- $k$ candidate captions/triplets are reranked across scales by an LLM. The relevance score for node $v$ is:

$score_{ppr}(v|q) = \alpha \cdot 1[v \in seed(q)] + (1-\alpha) \cdot \sum_{u} A_{u\to v} \cdot score_{ppr}(u|q)$

where $A_{u\to v}$ denotes KG edge weights (Yeo et al., 2 Dec 2025).

2.2 Semantic Memory

Using a fixed coarse scale $t_s$ , semantic triplets are extracted and consolidated via LLM prompts. The evolving graph $G_s^k$ is updated as:

$G_s^{k+1} = [G_s^k \setminus T_{remove}] \cup T_{update}$

where deletions and updates are determined by comparing new and existing embeddings. Retrieval scores per edge $e=(u\rightarrow v)$ aggregate PPR across source and target nodes (Yeo et al., 2 Dec 2025).

2.3 Visual Memory

Visual memory consists of:

Feature-based index: Segments ( $t_v$ ) yield embeddings $f_v^k \in ℝ^d$ .
Timestamp-based index: Stores precise $(t_i, I_i)$ pairs for frame-level retrieval.

Semantic queries are converted to text embeddings for cosine-similarity search:

$sim(f_t, f_v^k) = \frac{f_t \cdot f_v^k}{||f_t||\,||f_v^k||}$

Direct retrieval by timestamp is supported for frame lookup (Yeo et al., 2 Dec 2025).

3. Adaptive Retrieval Strategies

The retrieval agent operates in an iterative mode, deciding at each step whether the answer is sufficiently supported, which memory type and granularity to probe next, and whether to stop or refine the context. The agent leverages a few-shot LLM prompt with awareness of retrieval history and query intent. Decision criteria include sufficiency of current evidence, choice of modality (event, semantic, visual), and adaptability of temporal scope.

def RetrievalAgent(q, history):
    if HistoryContainsAnswerableEvidence(history):
        return (STOP, –, –)
    else:
        output ← LLM_decision(...)
        return (output.memory_type, output.search_query)

(Yeo et al., 2 Dec 2025)

This iterative, multi-modal, and multi-scale approach enables complex query resolution where events may span arbitrarily long intervals or require visual confirmation.

4. Multi-Scale Temporal Indexing and Controlled Retrieval

WorldMM utilizes a compositional strategy for temporal indexing, constructing overlapping segmentations at granularities suited to capture both fine and coarse temporal dependencies. Through dynamic retrieval, the agent can localize broad intervals via coarse scales and refine with finer segment retrieval. This is operationalized using temporal intersection-over-union (tIoU) metrics to assess retrieval span accuracy:

$tIoU(segment_{pred}, segment_{gt}) = \frac{|intersection|}{|union|}$

WorldMM achieves average tIoU ≈ 10%, outperforming baselines by a factor of two to three (Yeo et al., 2 Dec 2025).

5. Experimental Results and Comparative Performance

WorldMM demonstrates superior accuracy and robustness on five long video reasoning benchmarks. Using GPT-5 as the retrieval and response agent (Yeo et al., 2 Dec 2025):

Model	EgoLifeQA	Ego-R1	HippoVlog	LVBench	VideoMME(L)	Avg
M3-Agent	53.5	52.0	65.5	49.3	55.3	55.1
WorldMM-8B	56.4	52.0	69.7	55.4	66.0	59.9
WorldMM-GPT	65.6	65.3	78.3	61.9	76.6	69.5

Multi-turn retrieval further increases accuracy by $\sim$ 9.3% over single-step approaches. Ablations indicate that the joint use of episodic, semantic, and visual memory improves performance beyond any subset alone; e.g., episodic-only $62.6\%$ , full model $65.6\%$ on EgoLifeQA (Yeo et al., 2 Dec 2025).

WorldMM builds on lessons from preceding agents:

MIRIX: Modular memory structures (episodic, semantic, procedural, resource, KV) coordinated by a multi-agent protocol informed WorldMM’s multi-type, schema-based organization and compaction policies using cosine similarity, time decay, and importance scoring (Wang et al., 10 Jul 2025).
TeleMem: Batched writing, narrative-grounded extraction, and ReAct-style integration for text and video memory suggest structured clustering and flexible retrieval layers for efficient memory operations (Chen et al., 12 Dec 2025).
MemVerse: Hierarchical knowledge graphs with continual consolidation, adaptive forgetting, and periodic parametric distillation motivated WorldMM’s approach to scalable KG management and embedding-based retrieval (Liu et al., 3 Dec 2025).
ViLoMem: Dual-stream separation of visual and logical error patterns, with grow-and-refine updating and explicit two-stage retrieval, demonstrated the value of error-aware multi-store memory for reasoning accuracy (Bo et al., 26 Nov 2025).

WorldMM’s joint reliance on LLM-driven extraction, KG-based consolidation, iterative, multi-scale retrieval, and multimodal (text + visual) evidence sets it apart in terms of scalability and context fidelity.

7. Limitations and Areas for Further Research

WorldMM incurs substantial preprocessing costs due to multi-scale KG construction and semantic graph consolidation, though these operations are amenable to online pipelining. The system’s reliance on LLM heuristics and prompt engineering for both consolidation and adaptive retrieval introduces susceptibility to prompt-induced failure propagation. Privacy concerns arise in real-time deployments, particularly regarding the semantic graph’s potential to encode sensitive behavioral patterns. Further research is directed toward optimizing preprocessing, automating heuristic calibration, and developing robust privacy protections (Yeo et al., 2 Dec 2025).

Summary Table: WorldMM Core Components

Component	Function	Technique
Episodic Memory	Event filtering, multi-scale KGs	PPR, LLM triplet extraction
Semantic Memory	Habitual fact consolidation and evolution	LLM-based KG merges
Visual Memory	Detailed scene/frame evidence	Feature embedding, timestamp index
Retrieval Agent	Iterative, adaptive, cross-modal query	LLM few-shot decision prompting

WorldMM thus provides a principled, modular framework for dynamic, multimodal memory-based reasoning over long-horizon video, advancing context-aware agentic capabilities across several complex benchmarks (Yeo et al., 2 Dec 2025).