Lifelong Embodied Memory System (LEMS)

Updated 24 January 2026

LEMS is a paradigm that integrates long-term structural memory with short-term dynamic updates to support continual learning in embodied agents.
It employs multi-modal fusion, incremental construction, and hierarchical organization to handle complex, evolving 3D environments efficiently.
Advanced techniques such as compressed token memory, graph-based retrieval, and dual-memory strategies mitigate catastrophic forgetting and enhance scalability.

A Lifelong Embodied Memory System (LEMS) is an architectural and algorithmic paradigm for memory in embodied agents that enables persistent, scalable, and contextually rich accumulation and use of experience over extended deployments. LEMS unifies long-term structural percepts with short-term, dynamic contexts to support continual learning, reasoning, and action in complex, evolving 3D environments. Contemporary LEMS designs exploit multi-modal inputs, memory compression, incremental structure, and hierarchical organization to ensure both efficiency and adaptability across perception, navigation, and higher-level reasoning tasks.

1. Canonical Architectural Principles

A LEMS integrates persistent representations and rapid, transient caches to address the dual challenge of sustained knowledge retention and responsive, context-specific behavior. A typical system comprises two or more modules:

Long-Term Memory (LTM): Persistent storage of world structure (e.g., 3D scene graphs, spatial knowledge graphs, or memory snapshots), encoding enduring properties such as semantics, metric layout, and affordances. These representations are rarely purged and are optimized for accumulation over months or years (Wang et al., 2024, Yang et al., 2024, Lei et al., 2 Aug 2025).
Short-Term Memory (STM): Capacity-limited, frequently updated storage for the most recent observations, object states, or local contexts. Often realized as a cache of key-value tuples, FIFO buffers, or dynamic memories with advanced replacement strategies—e.g., LRU, LFU, or W-TinyLFU—optimized for recency or importance (Wang et al., 2024, Yin et al., 2022).
Multi-Modal Fusion: Inputs (e.g., RGB-D, LiDAR, audio, text) are processed through sensory hierarchies or visual encoders, with feature extraction often performed by pretrained or frozen backbones such as DINOv3-ViT or CLIP (Ren et al., 25 Dec 2025, Zhang et al., 30 Jun 2025).
Incremental Construction: Scene representations are built incrementally, supporting efficient updates without full recomputation and ensuring scalability with respect to both environment size and operational duration (Yang et al., 2024, Lei et al., 2 Aug 2025).

2. Representational Substrates and Memory Structures

LEMS implementations vary in their representational choices and data structures, reflecting different operational contexts:

Memory Substrate	Key Characteristics	Example Systems
3D Scene Graphs	Hierarchical, object-centric, spatial relationships	KARMA, 3D-Mem
Memory Snapshots	Multi-view image clusters with co-visible context	3D-Mem
Occupancy/Frontier Maps	Voxel or grid-based, supporting exploration policies	3D-Mem, BioSLAM
Compressed Visual Tokens	End-to-end learned, patch-based representation	AstraNav-Memory
Knowledge Graphs (KG)	Name-centric, semantic, or relational-graph-based	Ella, RoboMemory
Temporal Buffers	Episodic stores of recent or past events	RoboMemory, Ella

Notable approaches:

3D-Mem Memory Snapshots: Each snapshot $S_k = \langle O_{S_k}, I_{S_k} \rangle$ captures a cluster of co-visible objects and context using multi-view images. Features $f_k = \phi(I_{S_k}) \in \mathbb{R}^d$ are extracted by a frozen visual encoder. The set $\{ S_1, \ldots, S_K \}$ forms the memory, with coverage and exclusivity constraints on object clusters. Co-visibility clustering selects the optimal frames for object grouping (Yang et al., 2024).
Compressed Token Memory: AstraNav-Memory encodes each visual frame as $Z_t \in \mathbb{R}^{L \times d}$ tokens, compressing from original $T_0 \approx 598$ tokens per frame down to $L=30$ , achieving up to $16{\times}$ compression (Ren et al., 25 Dec 2025).
Scene Graphs and Knowledge Graphs: Scene graphs organize objects, regions, and agents into multilayered, spatially embedded graphs with linkages for traversability, containment, and affinity (Zhang et al., 30 Jun 2025, Lei et al., 2 Aug 2025).
Dual-Memory in BioSLAM: A static memory $\mathcal{M}_s$ (for long-term retention via centroid clusters) and a dynamic memory $\mathcal{M}_d$ (short-term buffer for replay) facilitate robust, adaptive place recognition (Yin et al., 2022).

3. Memory Update, Consolidation, and Management

Continuous operation imposes strict requirements for efficient update and retrieval:

Incremental Update Pipelines: New sensory input triggers localized updates—object-level updates followed by reclustering (as in 3D-Mem); STM caches are incrementally refreshed or replaced based on relevance scores (Yang et al., 2024, Wang et al., 2024).
Capacity Management and Pruning: To bound memory usage, systems adopt utility or recency-based pruning. E.g., 3D-Mem employs recency scores $R(S_k) = \exp\left(- \frac{t - t_k}{\tau}\right)$ and past retrieval frequencies to evict least-valuable snapshots (Yang et al., 2024); W-TinyLFU in KARMA maintains hit rates under fixed STM budgets (Wang et al., 2024).
Memory Consolidation: Consolidation mechanisms may merge redundant information (e.g., identical views in 3D-Mem) or compress trajectories (e.g., episodic summarization in EgoMem) (Yang et al., 2024, Yao et al., 15 Sep 2025).
Multi-memory Fusion: Systems such as RoboMemory implement parallel update and query for spatial (KG), temporal (FIFO buffer), episodic (retrieval-augmented generation), and semantic (“lessons learned”) memories. All modules are queried and integrated within each perception–action loop (Lei et al., 2 Aug 2025).

4. Retrieval Algorithms and Neural-Symbolic Interfaces

Efficient retrieval is critical to robust planning and reasoning:

Prefiltering and Embedding-based Similarity: 3D-Mem first filters snapshots by category using vision-LLM (VLM) prompting, then computes final cosine similarity between image features and query embeddings— $\text{Score}(S_k, Q)=\cos(\phi(I_{S_k}), \psi(Q))$ —retaining top candidates (Yang et al., 2024).
Vector-Space Retrieval and Semantic Filtering: STM entries in KARMA are embedded and selected via $\max$ -similarity to query embeddings, affording semantic reasoning and flexible match granularity (Wang et al., 2024).
Graph-based and Subgraph Expansion: Retrieval of subgraphs by $k$ -hop expansion facilitates context-aware planning, with conflict resolution guaranteeing semantic consistency (e.g., at most one relation per object pair in RoboMemory's KG) (Lei et al., 2 Aug 2025).
Multimodal Score Aggregation: In Ella, episode retrieval scores combine spatial proximity, semantic content similarity, and temporal recency, with normalization and averaging to yield final relevance (Zhang et al., 30 Jun 2025).
Neural-Symbolic Interface: Neural features (visual tokens, embeddings, etc.) are flattened or embedded into long-context transformer models along with symbolic content (scene graphs, plan steps, past actions), supporting both symbolic planning and deep retrieval (Ren et al., 25 Dec 2025, Lei et al., 2 Aug 2025).

5. Lifelong Properties and Catastrophe Mitigation

LEMS addresses catastrophic forgetting and unbounded growth through several principles:

Incremental Aggregation and Fixed-time Complexity: Systems such as Sparsey provide $O(1)$ learning and retrieval per new item, even with unbounded memory, via sparse distributed representations (SDRs) and winner-take-all modules (Rinkus, 2018).
Critical Periods and Metaplasticity: In Sparsey, early layer synapses are frozen after saturation (critical period), while top layers undergo ongoing plasticity with metaplastic decay, slowing forgetting of less-frequent patterns and yielding exponential capacity scaling with hierarchy depth (Rinkus, 2018).
Dual-memory and Generative Replay: BioSLAM dynamically balances short-term (dynamic) and long-term (static) memories, using gated generative replay weighted by experience “hardness” and familiarity, minimizing disruption of consolidated knowledge (Yin et al., 2022).
Compression and Sublinear Retrieval: 3D-Mem and AstraNav-Memory achieve memory size scaling that is linear or sub-linear in environment complexity, and retrieval by embedding-based selection instead of exhaustive search (Yang et al., 2024, Ren et al., 25 Dec 2025).
No Global Consolidation: Many systems avoid global consolidation passes in favor of local updates and utility-based pruning, favoring scalability (Yang et al., 2024, Zhang et al., 30 Jun 2025).

6. Empirical Outcomes and Benchmarks

Empirical evaluations quantify LEMS effectiveness:

System	Benchmark	Key Metrics/Results
3D-Mem	A-EQA, EM-EQA, GOAT-Bench	LLM-Match $\uparrow$ from 46.9% to 52.6%, SPL $\uparrow$ from 23.4% to 42.0%; Success $\uparrow$ from 61.5% to 69.1% (Yang et al., 2024)
KARMA	AI2-THOR Composite/Complex	Success Rate $\uparrow$ 1.3 $\times$ /2.3 $\times$ over baseline, Reduced Time $\uparrow$ up to 62.7 $\times$ (Wang et al., 2024)
AstraNav-Memory	GOAT, HM3D-OVON	SR +15.5 pp, SPL +29.2 pp (Val-Unseen, vs MTU3D), context length up to 200 compressed frames (Ren et al., 25 Dec 2025)
RoboMemory	EmbodiedBench	Success Rate $\uparrow$ 25% over Qwen2.5-VL-72B, outperforming Gemini-1.5-Pro by +3%; improvements persistent across repeated tasks (Lei et al., 2 Aug 2025)
Ella	Social Multi-Agent	Stable memory growth, higher completion and influence rates than unstructured memory baselines (Zhang et al., 30 Jun 2025)
EgoMem	Personalized Dialogs	Retrieval/trigger modules accuracy $>$ 95%, fact-consistency scores $>$ 87% in full-duplex, real-time operation (Yao et al., 15 Sep 2025)
BioSLAM	City, Campus Place Recognition	WR $\approx$ 91.2\% (city), $\approx$ 76.1\% (campus), outperforming SOTA by up to 24% (Yin et al., 2022)

The systems above demonstrate improvements in navigation, question answering, task success, and memory robustness, with ablations confirming the necessity of dual-memory components and advanced replacement policies.

7. Generalizations, Variations, and Open Challenges

Contemporary LEMS support various generalizations:

Applicability Across Modalities: All examined systems are multi-modal, incorporating RGB-D, LiDAR, audio, and text features as appropriate for the task domain (Yang et al., 2024, Ren et al., 25 Dec 2025, Yao et al., 15 Sep 2025).
Extensions Beyond Navigation: While navigation remains the canonical benchmark (GOAT-Bench, HM3D-OVON, EmbodiedBench), LEMS are increasingly applied to social interaction (Ella), personalized dialogs (EgoMem), and manipulation (Zhang et al., 30 Jun 2025, Yao et al., 15 Sep 2025).
Scaling and Compression: Compression techniques (pixel unshuffle, patch merging, DINO features, token flattening) enable temporal context windows spanning hundreds of steps or more (Ren et al., 25 Dec 2025). Excessive compression degrades performance, with optimal trade-offs reached between memory length and fidelity in empirical studies.
Open Questions: Most systems rely on similarity-based retrieval rather than explicit, multi-hop graph reasoning; direct exploitation of knowledge graph structure for complex querying remains an area for future development (Zhang et al., 30 Jun 2025).
No True Biological Plausibility: While neural-symbolic integration, SDRs, and metaplasticity mechanisms draw inspiration from biology, most systems (KARMA, AstraNav) adopt computer system techniques (cache replacement, API structuring) rather than neurobiological learning and consolidation protocols (Wang et al., 2024).

A plausible implication is that LEMS design principles—modular dual-memory, incremental update, compression, embedding-based retrieval, and task-conditioned utility pruning—constitute a shared substrate for lifelong, scalable embodied intelligence across diverse operational settings.