Papers
Topics
Authors
Recent
Search
2000 character limit reached

Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory

Published 1 May 2026 in cs.CL | (2605.00702v1)

Abstract: LLM agents require long-term user memory for consistent personalization, but limited context windows hinder tracking evolving preferences over long interactions. Existing memory systems mainly rely on static, hand-crafted update rules; although reinforcement learning (RL)-based agents learn memory updates, sparse outcome rewards provide weak supervision, resulting in unstable long-horizon optimization. Drawing on memory schema theory and the functional division between prefrontal regions and hippocampus regions, we introduce MemCoE, a cognition-inspired two-stage optimization framework that learns how memory should be organized and what information to update. In the first stage, we propose Memory Guideline Induction to optimize a global guideline via contrastive feedback interpreted as textual gradients; in the second stage, Guideline-Aligned Memory Policy Optimization uses the induced guideline to define structured process rewards and performs multi-turn RL to learn a guideline-following memory evolution policy. We evaluate on three personalization memory benchmarks, covering explicit/implicit preference and different sizes and noise, and observe consistent improvements over strong baselines with favorable robustness, transferability, and efficiency.

Summary

  • The paper introduces MemCoE, a cognition-inspired two-stage optimization that separates memory guideline induction from policy optimization to enhance personalization.
  • It leverages contrastive textual feedback and RL-based Group Relative Policy Optimization to efficiently manage memory evolution in long, multi-turn dialogues.
  • Experimental results show improved accuracy, long-term preference retention, and scalability compared to static heuristics and other RL-based memory agents.

Cognition-Inspired Two-Stage Optimization for Evolving User Memory in LLM Agents

Motivation and Problem Statement

LLM agents require the capacity to maintain and evolve long-term user memories for effective personalization in multi-turn dialogue and agentic interaction. Context-window limitations of LLMs preclude direct retention of long histories, while existing memory augmentation strategies—whether static heuristics or reinforcement-learning (RL) based—fail to robustly capture evolving user preferences over extended interactions or under noisy, non-stationary inputs. Static rule-based schemes lack adaptability, and RL approaches are bottlenecked by sparse supervision, large action spaces, and instabilities in long-horizon policy optimization. Inspired by memory schema theory and neuroanatomical findings on the division of labor between prefrontal cortex (schema selection/organization) and hippocampus (episodic encoding), this paper introduces MemCoE: a two-stage optimization paradigm with distinct phases for learning "how" memory should be structured and "what" to store at each point in the dialogue trajectory. Figure 1

Figure 1: With a limited context window (top), the agent cannot track preferences; by decoupling memory guideline (prefrontal-like) and episodic agent memory (hippocampal-like) (bottom), organization and content updating are separated for higher fidelity personalization.

Method: MemCoE Two-Stage Optimization

MemCoE operationalizes the cognitive analogy via a split architecture:

  1. Memory Guideline Induction (MGI): Treats the memory update instruction prompt as an optimizable natural-language parameter encoding memory organization schemas. Using a labeled dataset of dialogue-query pairs, contrastive feedback over generated memory-augmented trajectories is converted into batch-aggregated "textual gradients" driving prompt/token optimization. This produces a domain-agnostic, explicit, and transferable memory guideline S⋆\mathcal{S}^\star.
  2. Guideline-Aligned Memory Policy Optimization (GMPO): Conditioned on S⋆\mathcal{S}^\star, this stage uses RL (specifically Group Relative Policy Optimization, GRPO) to induce a memory evolution policy maximally aligned with both the learned guideline and downstream answer correctness, solving for "what" content is extracted, retained, or forgotten in each round via more granular, guideline-aligned and answer-based dense rewards. Figure 2

    Figure 2: The two-stage framework: first, Memory Guideline Induction develops a robust textual guideline; second, GMPO aligns memory policy evolution to this fixed organizing principle via RL over memory-augmented trajectories.

Experimental Validation

MemCoE is evaluated on three personalization memory benchmarks: PersonaMem (long-term preference evolution), PrefEval (explicit/implicit preference following), and PersonaBench (heterogeneous and noisy user grounding). Metrics are task dependent: accuracy for PersonaMem and PrefEval, F1 for PersonaBench.

Empirically, MemCoE demonstrates superior performance relative to both classical RAG, static heuristics, and state-of-the-art RL-based agents (MemAgent, MEM-α\alpha) across all settings—robustness to noise, out-of-domain generalization, and efficient scaling with longer histories are explicitly shown. Figure 3

Figure 3: Efficiency/performance tradeoff on PersonaMem (32K): MemCoE achieves highest accuracy within a regime of comparatively low runtime variance.

Figure 4

Figure 4: Retrieval Top-K analysis: MemCoE outperforms baselines across all retriever settings, with optimality at K=20 due to effective filtering during memory evolution.

Figure 5

Figure 5: Impact of tokens per memory evolution round: performance is maximized at an intermediate token budget, which balances update frequency and context complexity.

Figure 6

Figure 6: Quality of prompt/guideline directly affects PersonaMem accuracy, with the full MemCoE induction pipeline yielding consistent and significant gains over manual and standard LLM-based prompt optimization.

Preference Retention and Robustness Analysis

Long-horizon retention and resilience to preference forgetting/drift are crucial for agentic personalization. Multi-round analysis (PrefEval, explicit tracking) shows that MemCoE exhibits much slower degradation (from 100% to ≈74% over 10 rounds) in preference retention relative to RL memory agents (which rapidly fall to ≈51%): Figure 7

Figure 7: Preference retention curves during multi-round memory evolution: MemCoE retains preference-relevant memory significantly more robustly than RL-based baselines.

Error decomposition reveals most remaining errors are due to memory evolution (rather than answer generation), validating the focus on optimizing the update stage as the primary bottleneck: Figure 8

Figure 8: Error analysis: predominant failure mode is at the memory evolution stage, supporting priority of improvement efforts for upstream retention.

Scalability

Evaluation on longer histories (PersonaMem 128K+) indicates sublinear memory bank scaling and stable memory evolution time, confirming the framework is computationally tractable even for million-token corpora: Figure 9

Figure 9

Figure 9: As dialogue context grows, memory bank size and evolving time remain bounded and predictable.

Cross-LLM Transferability and Ablation

Guidelines induced by MemCoE are empirically shown to transfer successfully across diverse backbone LLMs (Qwen2.5-7B, gpt-4o-mini, Gemini-2.5-flash, GPT-5) without retraining RL policies, underlining that the organizational principle is robust and agnostic to underlying model specifics.

Ablations confirm that both contrastive textual feedback in MGI and guideline-aligned RL are essential. Removing either stage significantly degrades performance, and ablation of all structure reduces results to baseline RL-memory agent levels.

Practical and Theoretical Implications

MemCoE presents a paradigm shift in agentic memory modeling for LLM systems—by separating the "how" (organizational schema) and "what" (content update) with explicit, optimizable interfaces. This supports:

  • Sample-efficient RL: Dense, process-level rewards (rather than outcome-only) yield more stable updates and require fewer samples.
  • Transferable policies: Textual guideline abstraction allows consistent behavior independent of backbone LLM.
  • Efficient scaling: Structured, evidence-bounded updates enable tractable scaling to long histories and noisy data with minimal memory bloat.
  • Personalization robustness: Stronger long-temporal retention and explicit conflict handling facilitate trustworthy adaptive behavior, necessary for safety-critical or privacy-sensitive personalization.

Potential future extensions include guidelines adaptable to multi-objective trade-offs (e.g., stability vs. plasticity, brevity vs. informativeness), integration with graph-structured or hierarchical memory indices, and joint optimization with user-alignment or privacy objectives.

Conclusion

MemCoE introduces and validates a cognition-inspired, two-stage optimization pipeline for LLM agent memory: first learning robust, generalizable organization schemas via batch textual contrast, then aligning content evolution policies under dense, guideline-aware reward with RL. Across benchmarks, MemCoE achieves leading accuracy, efficiency, and robustness to long-horizon noise and distributional drift. This separation of schema induction and episodic content updating marks a practical advance in memory-augmented dialogue systems, and is directly extensible to next-generation agentic architectures.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.