Dream to Recall: Imagination-Guided Experience Retrieval for Memory-Persistent Vision-and-Language Navigation

Published 9 Oct 2025 in cs.CV, cs.AI, and cs.RO | (2510.08553v1)

Abstract: Vision-and-Language Navigation (VLN) requires agents to follow natural language instructions through environments, with memory-persistent variants demanding progressive improvement through accumulated experience. Existing approaches for memory-persistent VLN face critical limitations: they lack effective memory access mechanisms, instead relying on entire memory incorporation or fixed-horizon lookup, and predominantly store only environmental observations while neglecting navigation behavioral patterns that encode valuable decision-making strategies. We present Memoir, which employs imagination as a retrieval mechanism grounded by explicit memory: a world model imagines future navigation states as queries to selectively retrieve relevant environmental observations and behavioral histories. The approach comprises: 1) a language-conditioned world model that imagines future states serving dual purposes: encoding experiences for storage and generating retrieval queries; 2) Hybrid Viewpoint-Level Memory that anchors both observations and behavioral patterns to viewpoints, enabling hybrid retrieval; and 3) an experience-augmented navigation model that integrates retrieved knowledge through specialized encoders. Extensive evaluation across diverse memory-persistent VLN benchmarks with 10 distinctive testing scenarios demonstrates Memoir's effectiveness: significant improvements across all scenarios, with 5.4% SPL gains on IR2R over the best memory-persistent baseline, accompanied by 8.3x training speedup and 74% inference memory reduction. The results validate that predictive retrieval of both environmental and behavioral memories enables more effective navigation, with analysis indicating substantial headroom (73.3% vs 93.4% upper bound) for this imagination-guided paradigm. Code at https://github.com/xyz9911/Memoir.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a unified framework using imagination-guided queries to selectively retrieve both visual and behavioral experiences for enhanced navigation.
It employs a language-conditioned contrastive world model with multi-step overshooting to improve long-horizon prediction and retrieval quality.
Memoir demonstrates significant benchmark improvements with enhanced SPL, reduced training latency, and lower memory consumption for efficient VLN performance.

Introduction

The paper "Dream to Recall: Imagination-Guided Experience Retrieval for Memory-Persistent Vision-and-Language Navigation" (Memoir) addresses the challenge of enabling embodied agents to progressively improve navigation performance by leveraging accumulated experience across episodes. Traditional VLN agents operate in an episodic fashion, lacking mechanisms for long-term memory integration and thus failing to adapt to persistent, real-world scenarios. Existing memory-persistent VLN methods either incorporate all available memory, leading to computational inefficiency and noise, or rely on fixed-horizon lookups, risking loss of valuable experience. Furthermore, most approaches focus solely on environmental observations, neglecting behavioral histories that encode strategic decision-making patterns.

Memoir introduces a unified framework where a language-conditioned world model imagines future navigation states to generate queries for selective retrieval of both environmental observations and behavioral histories. This paradigm enables adaptive, viewpoint-level memory access, facilitating robust navigation planning grounded in explicit long-term memory.

Figure 1: Overview of Memoir's workflow for experience retrieval via imagination, showing episodic memory bank population and imagination-guided retrieval for navigation planning.

Methodology

Language-Conditioned Contrastive World Model

Memoir employs a contrastive variational world model conditioned on both visual observations and natural language instructions. The model encodes navigation histories into latent state representations and recursively imagines future states, which serve as queries for memory retrieval. The contrastive objective replaces pixel-level reconstruction with semantic compatibility estimation between latent states and observations, enabling efficient and discriminative representation learning. Multi-step overshooting is incorporated to enhance long-horizon predictive capability, improving retrieval quality for extended navigation trajectories.

Hybrid Viewpoint-Level Memory (HVM)

Memoir maintains two complementary memory banks anchored to viewpoints:

Observation Bank: Stores panoramic visual features for each visited viewpoint.
History Bank: Stores inferred agent states and imagined trajectory sequences for each viewpoint, capturing behavioral patterns across episodes.

At each navigation step, the agent updates both banks with current observations and imagined trajectories. Retrieval is performed via:

Observation Retrieval: Topology-guided search using state-observation compatibility scores, filtering and ranking candidate viewpoints based on imagined future states.
History Retrieval: Sequential similarity matching between current imagined trajectories and stored historical patterns, selecting top matches for integration.
Figure 2: Details of imagination-guided experience retrieval, including contrastive training, recursive imagination, dual retrieval, and specialized encoders for decision integration.

Memoir extends the Dual-Scale Graph Transformer (DUET) architecture with three specialized encoders:

Coarse-Scale Encoder: Processes global observations, including retrieved viewpoints.
Fine-Scale Encoder: Processes immediate panoramic features for local action planning.
Navigation-History Encoder: Fuses retrieved historical states with current viewpoint features, enabling history-informed decision-making.

A dynamic fusion mechanism balances contributions from all branches, producing final action scores for navigation.

Experimental Results

Quantitative Performance

Memoir is evaluated on IR2R and GSA-R2R benchmarks, demonstrating consistent improvements over both traditional and memory-persistent baselines. On IR2R unseen environments, Memoir achieves a 5.4% SPL gain over GR-DUET (73.3% vs. 67.9%), with 8.3× training speedup and 74% inference memory reduction. The oracle retrieval upper bound (93.4% SPL) indicates substantial headroom for further improvement.

Figure 3: Performance scaling across tour progression on IR2R, showing Memoir's sustained improvement with accumulated experience.

Memoir also outperforms adaptation-based and memory-based methods on GSA-R2R across diverse user and scene instructions, with average gains of 2.38% SR and 1.59% SPL over GR-DUET.

Qualitative Analysis

Case studies illustrate Memoir's ability to retrieve relevant observations and behavioral histories, enabling successful navigation in scenarios where baselines fail due to ambiguity or lack of strategic context.

Figure 4: Visualization of Memoir's memory retrieval from both banks and trajectory comparison with DUET and GR-DUET.

Computational Efficiency

Memoir's selective retrieval mechanism yields significant resource savings compared to complete memory incorporation strategies. Training memory usage is reduced by 55%, and training latency by 88%, making Memoir suitable for deployment in resource-constrained settings.

Ablation Studies

Ablations confirm the necessity of both observation and history retrieval, the superiority of imagination-guided queries over random or exhaustive memory access, and the benefits of advanced world model architectures (Transformer with overshooting). Incorporating neighbor viewpoints and completing partial observations further enhance navigation performance.

Figure 5: Hyper-parameter study for retrieval mechanisms, showing the impact of filter rates and search width on navigation performance.

Failure Modes and Future Directions

Failure analysis reveals limitations in world model predictive accuracy and retrieval confidence, particularly in distinguishing spatially similar but semantically distinct targets. The agent may over-exploit retrieved experience, failing to explore novel alternatives when memory is misleading.

Figure 6: Visualization of failure modes in imagination-guided memory retrieval, highlighting retrieval ambiguities and exploitation-exploration trade-offs.

Implications and Future Work

Memoir establishes a principled connection between predictive simulation and explicit memory retrieval for embodied navigation. The approach demonstrates that imagination-guided, adaptive memory access enables agents to leverage both environmental and behavioral experience for robust, scalable navigation. The substantial gap between current performance and the oracle upper bound motivates further research in world model scaling, explicit spatial relationship modeling, and confidence-aware retrieval mechanisms.

Potential future developments include:

Large-scale pretraining of world models for improved predictive fidelity.
Integration of spatial reasoning modules to enhance retrieval discrimination.
Dynamic confidence estimation for balancing exploitation and exploration during navigation.

Conclusion

Memoir introduces an imagination-guided experience retrieval paradigm for memory-persistent VLN, leveraging a language-conditioned world model and hybrid viewpoint-level memory to enable adaptive, efficient, and effective navigation. Extensive empirical results validate the approach's superiority in both performance and resource efficiency, while ablation and failure analyses highlight avenues for further advancement in predictive retrieval and embodied intelligence.

Markdown