Papers
Topics
Authors
Recent
Search
2000 character limit reached

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

Published 27 Aug 2025 in cs.CL and cs.MA | (2508.19828v2)

Abstract: LLMs have demonstrated impressive capabilities across a wide range of NLP tasks, but they remain fundamentally stateless, constrained by limited context windows that hinder long-horizon reasoning. Recent efforts to address this limitation often augment LLMs with an external memory bank, yet most existing pipelines are static and heuristic-driven, lacking any learned mechanism for deciding what to store, update, or retrieve. We present Memory-R1, a reinforcement learning (RL) framework that equips LLMs with the ability to actively manage and utilize external memory through two specialized agents: a Memory Manager that learns to perform structured memory operations, including adding, updating, deleting, or taking no operation on memory entries; and an Answer Agent that selects the most relevant entries and reasons over them to produce an answer. Both agents are fine-tuned with outcome-driven RL (PPO and GRPO), enabling adaptive memory management and utilization with minimal supervision. With as few as 152 question-answer pairs and a corresponding temporal memory bank for training, Memory-R1 outperforms the strongest existing baseline and demonstrates strong generalization across diverse question types and LLM backbones. Beyond presenting an effective approach, this work provides insights into how RL can unlock more agentic, memory-aware behavior in LLMs, pointing toward richer, more persistent reasoning systems.

Summary

  • The paper introduces a RL framework where a Memory Manager and an Answer Agent collaboratively optimize memory operations to boost answer accuracy.
  • It employs PPO and GRPO to fine-tune agents for CRUD-style memory updates and selective distillation of relevant dialogue context.
  • The model outperforms baselines on LOCOMO with substantial gains in F1, BLEU-1, and semantic correctness, even with minimal training data.

Memory-R1: Reinforcement Learning for Memory Management in LLM Agents

Introduction

Memory-R1 presents a reinforcement learning (RL) framework for augmenting LLM agents with adaptive, structured memory management and utilization capabilities. The stateless nature of LLMs, constrained by finite context windows, limits their ability to perform long-horizon reasoning and maintain persistent knowledge across multi-session dialogues. Existing approaches typically rely on static, heuristic-driven memory pipelines, which are suboptimal for dynamic, evolving conversational contexts. Memory-R1 addresses these limitations by introducing two RL-fine-tuned agents: a Memory Manager for CRUD-style memory operations and an Answer Agent for selective memory distillation and reasoning.

Methodology

Memory-R1 Architecture

Memory-R1 consists of two specialized components:

  • Memory Manager: Trained via RL (PPO or GRPO), this agent decides whether to ADD, UPDATE, DELETE, or NOOP for each new piece of information extracted from dialogue turns. The manager operates over a temporal memory bank, incrementally evolving the memory state to maximize downstream QA performance.
  • Answer Agent: Also RL-fine-tuned, this agent receives up to 60 candidate memories retrieved via RAG for each question. It applies a Memory Distillation policy to filter and select the most relevant entries, then generates the final answer conditioned on the distilled context.

Both agents are trained with outcome-driven rewards, using exact match between predicted and gold answers as the primary signal. The RL setup enables the agents to learn memory operations and utilization strategies that directly optimize for answer correctness, rather than relying on manually annotated intermediate supervision.

RL Fine-Tuning Procedures

  • PPO (Proximal Policy Optimization): Used for both agents, PPO stabilizes policy updates via a clipped surrogate objective, ensuring robust convergence. The reward is derived from the improvement in answer accuracy after memory operations.
  • GRPO (Group Relative Policy Optimization): An alternative to PPO, GRPO samples groups of candidate actions and computes relative advantages within the group, obviating the need for a learned value function and improving sample efficiency.

The reward function for both agents is strictly outcome-based, defined as Ranswer=EM(ypred,ygold)R_{answer} = \mathrm{EM}(y_{\text{pred}}, y_{\text{gold}}), where EM is the exact match score.

Data Construction

Training data is constructed from the LOCOMO benchmark, which features multi-turn, multi-session dialogues and associated QA pairs. For the Memory Manager, each training tuple consists of a dialogue turn, a temporal memory bank (preceding 50 turns), and QA pairs. For the Answer Agent, each tuple includes a question, 60 retrieved candidate memories, and the gold answer.

Experimental Results

Benchmarking and Metrics

Memory-R1 is evaluated on the LOCOMO benchmark using LLaMA-3.1-8B-Instruct and Qwen-2.5-7B-Instruct backbones. Metrics include token-level F1, BLEU-1, and LLM-as-a-Judge (semantic correctness). Baselines include LOCOMO, Zep, A-Mem, LangMem, and Mem0, all re-implemented for consistency.

Main Findings

  • Performance: Memory-R1-GRPO achieves an overall F1 of 45.02, BLEU-1 of 37.51, and LLM-as-a-Judge of 62.74 on LLaMA-3.1-8B, outperforming Mem0 by 68.9% (F1), 48.3% (BLEU-1), and 37.1% (Judge). Similar gains are observed on Qwen-2.5-7B.
  • Data Efficiency: Strong generalization is achieved with as few as 152 training QA pairs, demonstrating high sample efficiency.
  • Component Analysis: RL fine-tuning of both Memory Manager and Answer Agent yields substantial improvements over vanilla LLMs. Memory Distillation further enhances answer accuracy by filtering out irrelevant context.
  • Policy Comparison: GRPO converges faster than PPO but both reach comparable final performance.

Ablation and Case Studies

  • RL-trained Memory Manager consolidates overlapping or complementary information via UPDATE operations, avoiding fragmentation and loss of context observed in vanilla managers.
  • RL-trained Answer Agent with Memory Distillation reliably selects relevant memories, improving factual accuracy and robustness to distractors.

Implementation Considerations

Resource Requirements

  • Training is performed on 4×H100 GPUs (80GB each), with batch size 128 and micro-batch size 2 per GPU.
  • Maximum prompt and response lengths are set to 4096 and 2048 tokens, respectively.
  • PPO requires actor and critic networks; GRPO only trains the actor.

Deployment Strategies

  • RL fine-tuning can be performed with minimal supervision, making Memory-R1 suitable for real-world applications with limited labeled data.
  • The modular architecture allows integration with various LLM backbones and memory retrieval systems.

Limitations

  • The outcome-based reward design may not capture nuanced memory relevance in cases where answer correctness is insufficiently sensitive to memory operations.
  • Scaling to extremely large memory banks may require further optimization of retrieval and distillation mechanisms.

Implications and Future Directions

Memory-R1 demonstrates that RL is an effective paradigm for teaching LLM agents adaptive memory management and utilization, enabling persistent, long-horizon reasoning. The framework sets a new state of the art on LOCOMO and generalizes across model architectures. Future research may explore:

  • Compositional memory architectures for hierarchical or multi-modal memory.
  • Integration with lifelong learning and continual adaptation.
  • More sophisticated reward functions incorporating intermediate reasoning steps or human feedback.
  • Scaling to open-domain, multi-agent environments.

Conclusion

Memory-R1 establishes RL as a principled approach for equipping LLM agents with agentic, memory-aware behavior. By jointly optimizing memory operations and answer generation, the framework achieves substantial gains in long-term conversational reasoning with minimal supervision. The results highlight the potential of RL for advancing persistent knowledge retention and adaptive reasoning in LLM-based systems.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of what remains missing, uncertain, or unexplored, based on the paper’s methods, experiments, and claims:

  • External validity beyond LOCOMO
    • Only one benchmark (LOCOMO) and 10 dialogues were used; the method’s generalization to other long-horizon dialogue corpora, task settings (e.g., customer support logs, programming sessions), or non-dialogue memory tasks is untested.
    • The adversarial subset of LOCOMO was excluded; robustness to adversarial memory traces or misleading context remains unknown.
    • No evaluation on out-of-domain shifts (different styles, topics, user populations) or drastically longer histories than ~26k tokens.
  • Data efficiency and scaling behavior
    • The “152 QA pairs” claim lacks learning curves; it is unclear how performance scales with more/less data, or where diminishing returns begin.
    • Limited seed variability analysis (3 runs); sensitivity to initialization, sampling temperature during RL, and dataset composition is not quantified.
    • No ablation on curriculum or joint vs. staged training order (e.g., MM first vs. AA first vs. interleaved).
  • Memory extraction and update mechanics
    • “LLMExtract” is a black box: its prompt, quality, and error profile are not evaluated; downstream effects of extraction noise are unknown.
    • The “Merge(mj, fi)”/UPDATE operation is underspecified: how conflicting/ambiguous facts are merged, how granularity is chosen, and how contradictions are resolved are not defined or evaluated.
    • No provenance tracking or auditability: memory entries are not linked to source spans/turn IDs, making it hard to verify that updates reflect the dialogue faithfully.
  • Potential reward hacking and faithfulness
    • Memory Manager (MM) is rewarded only via downstream Answer Agent (AA) exact match; the MM could write “convenient” but unfaithful content that yields correct answers. There is no constraint/reward on factual consistency with the source dialogue.
    • No safeguards against catastrophic deletions or irreversible edits; lack of versioning/rollback or confidence-triggered NOOP policies.
  • Retrieval and distillation limitations
    • Retrieval is fixed to top-60 similarity-based RAG; the retriever model is unspecified and not learned/tuned. Effects of different k, embeddings, and indexing strategies are unstudied.
    • Memory Distillation is described functionally but lacks diagnostics: no precision/recall of selected memories, ablations on number of distilled entries, or interpretability analyses of selection decisions.
    • No exploration of learned retrievers or end-to-end training where retrieval, selection, and answering co-adapt.
  • Reward design and optimization
    • Rewards are based solely on exact match (EM), which is brittle for paraphrased/long-form answers and can bias toward short outputs; alternative or composite rewards (semantic similarity, LLM-judge, faithfulness to retrieved spans) are not investigated.
    • Extremely delayed credit assignment: many MM steps precede a single QA reward. There is no analysis of credit assignment challenges, temporal difference baselines, or auxiliary shaping (e.g., local consistency rewards).
    • PPO/GRPO sensitivity (clip ranges, KL penalties, group size G, temperature) and stability under hyperparameter changes are not reported.
  • Joint training and non-stationarity
    • The AA is frozen while training the MM; conversely, MM is fixed when training AA. It is unclear whether joint or alternating training could yield better equilibria or suffer from instability, and how to mitigate non-stationarity if co-trained.
  • Operator set and memory structure
    • Operations limited to {ADD, UPDATE, DELETE, NOOP}; no SEARCH/linking operator, no support for multi-entry updates, entity normalization, or structured schemas (e.g., temporally scoped slots, knowledge graphs).
    • No study of multi-granularity memories (events vs. summaries vs. entities), nor policies for when to compress/summarize vs. retain raw facts.
  • Scalability, efficiency, and system costs
    • Growth dynamics and steady-state size of the memory bank are not measured; no latency or throughput characterization for retrieval, distillation, and RL inference.
    • Compute cost of RL (4×H100) is substantial; no exploration of lighter adapters, off-policy reuse, or distillation of RL policies to smaller inference-time models.
  • Robustness, safety, and privacy
    • No analysis of robustness to noisy, contradictory, or malicious inputs (memory poisoning, prompt injection into stored memories).
    • Privacy and data governance are unaddressed: policies for retention, redaction, selective forgetting, and compliance (e.g., GDPR “right to be forgotten”) are missing.
    • No uncertainty calibration or “abstain/ask-for-clarification” behavior when memory is insufficient or conflicting.
  • Evaluation methodology
    • Heavy reliance on automatic metrics (F1, BLEU-1) and an LLM-as-a-judge; no human evaluation or validation of the judge’s reliability and correlation with human judgments.
    • Baselines were re-implemented; fairness and parity of tuning with their original settings are not audited. Sensitivity to prompt choices across methods is unreported.
  • Modalities and tools
    • The approach is text-only; applicability to multimodal memories (images/screenshots/audio) is untested.
    • Interactions with external tools/knowledge bases (beyond RAG) and how memory policies should integrate with tool-use policies are unexplored.
  • Temporal reasoning and consistency
    • Despite temporal tasks in LOCOMO, there is no explicit modeling of time in memory schemas (timestamps, validity intervals, decay), nor evaluations of temporal consistency after multiple updates.
  • Interpretability and analysis
    • Limited qualitative error analysis; no taxonomy of failure modes (e.g., over-deletion, spurious updates, missed merges).
    • No transparency mechanisms for why the MM chose an operation or why AA selected certain memories; lack of explanations hinders debugging and trust.
  • Reproducibility and artifacts
    • Code, prompts, and trained checkpoints are not clearly released; exact retriever, embedding models, and index settings are unspecified, which hampers replication.

These gaps suggest concrete next steps: introduce provenance-aware, versioned memory with factuality rewards; learn the retriever jointly with distillation and answering; develop composite, semantic and faithfulness-aware rewards; explore joint MM–AA training; add temporal schemas and multi-granularity memories; measure system-level efficiency; stress-test robustness and privacy; and validate findings across diverse datasets, modalities, and human evaluations.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 28 tweets with 224 likes about this paper.