Papers
Topics
Authors
Recent
Search
2000 character limit reached

Memento: Fine-tuning LLM Agents without Fine-tuning LLMs

Published 22 Aug 2025 in cs.LG and cs.CL | (2508.16153v2)

Abstract: In this paper, we introduce a novel learning paradigm for Adaptive LLM agents that eliminates the need for fine-tuning the underlying LLMs. Existing approaches are often either rigid, relying on static, handcrafted reflection workflows, or computationally intensive, requiring gradient updates of LLM model parameters. In contrast, our method enables low-cost continual adaptation via memory-based online reinforcement learning. We formalise this as a Memory-augmented Markov Decision Process (M-MDP), equipped with a neural case-selection policy to guide action decisions. Past experiences are stored in an episodic memory, either differentiable or non-parametric. The policy is continually updated based on environmental feedback through a memory rewriting mechanism, whereas policy improvement is achieved through efficient memory reading (retrieval). We instantiate our agent model in the deep research setting, namely \emph{Memento}, which attains top-1 on GAIA validation ($87.88\%$ Pass@$3$) and $79.40\%$ on the test set. It reaches $66.6\%$ F1 and $80.4\%$ PM on the DeepResearcher dataset, outperforming the state-of-the-art training-based method, while case-based memory adds $4.7\%$ to $9.6\%$ absolute points on out-of-distribution tasks. Our approach offers a scalable and efficient pathway for developing generalist LLM agents capable of continuous, real-time learning without gradient updates, advancing machine learning towards open-ended skill acquisition and deep research scenarios. The code is available at https://github.com/Agent-on-the-Fly/Memento.

Summary

  • The paper introduces Memento, a memory-augmented framework that enables continual adaptation of LLM agents without fine-tuning the underlying LLMs.
  • It leverages a memory-based Markov Decision Process with both non-parametric and parametric retrieval to integrate case-based reasoning for dynamic tool use and multi-step planning.
  • Empirical results show state-of-the-art performance across benchmarks, with significant improvements in long-horizon research tasks and out-of-distribution generalization.

Memory-Augmented Continual Adaptation for LLM Agents: The Memento Framework

Introduction and Motivation

The paper introduces Memento, a learning paradigm for LLM-based agents that enables continual adaptation without fine-tuning the underlying LLM parameters. The motivation stems from the limitations of current LLM agent paradigms: static, workflow-based systems lack flexibility, while parameter fine-tuning approaches are computationally expensive and impractical for real-time, open-ended adaptation. Memento addresses this by leveraging external, episodic memory and case-based reasoning (CBR), formalized as a memory-augmented Markov Decision Process (M-MDP), to enable agents to learn from experience in a non-parametric, scalable manner.

Memory-Based Markov Decision Process and Case-Based Reasoning

Memento formalizes the agent's decision process as an M-MDP, extending the standard MDP tuple S,A,P,R,γ\langle \mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \gamma \rangle with a memory space M\mathcal{M} that stores episodic trajectories. At each timestep, the agent retrieves a relevant case from memory using a learned retrieval policy μ\mu, adapts the retrieved solution via the LLM, executes the action, and appends the new experience to memory. This process is governed by a policy:

π(as,M)=cMμ(cs,M)pLLM(as,c)\pi(a|s,M) = \sum_{c\in M} \mu(c|s,M) p_{\text{LLM}}(a|s,c)

where MM is the case bank, cc is a case, and pLLMp_{\text{LLM}} is the LLM's action likelihood conditioned on the current state and retrieved case. Figure 1

Figure 1: A graphical model of memory-based Markov Decision Process.

The retrieval policy μ\mu is optimized via maximum entropy RL (soft Q-learning), encouraging both exploitation of high-utility cases and exploration/diversity in retrieval. The Q-function Q(s,M,c)Q(s, M, c) estimates the expected return of selecting case cc in state ss with memory MM, and the optimal retrieval policy is a softmax over Q-values:

μ(cs,M)=exp(Q(s,M,c)/α)cMexp(Q(s,M,c)/α)\mu^*(c|s,M) = \frac{\exp(Q^*(s,M,c)/\alpha)}{\sum_{c' \in M} \exp(Q^*(s,M,c')/\alpha)}

To address the challenge of high-dimensional, natural language state and case spaces, the Q-function can be approximated via kernel-based episodic control or a neural network, depending on the memory variant.

Planner–Executor Architecture and Memory Management

Memento is instantiated as a planner–executor framework. The planner is an LLM-based CBR agent that alternates between retrieving relevant cases from memory (Read) and recording new experiences (Write), with the retrieval policy either similarity-based (non-parametric) or Q-function-based (parametric). The executor is an LLM-based client that invokes external tools via the Model Context Protocol (MCP), enabling compositional tool use and dynamic reasoning. Figure 2

Figure 2: The architecture of Memento with parametric memory, alternating between Case-Based Planning and Tool-Based Execution.

The memory module supports both non-parametric (vectorized similarity search) and parametric (Q-function) retrieval. In the non-parametric setting, retrieval is based on cosine similarity between the current state and stored cases. In the parametric setting, the Q-function is trained online (using cross-entropy loss for binary rewards) to predict the utility of each case, and retrieval is performed by selecting the top-K cases with the highest Q-values.

Tool Integration and Deep Research Scenarios

Memento is designed for deep research tasks requiring long-horizon planning, multi-step tool use, and reasoning over heterogeneous data. The MCP-based executor supports a suite of tools for web search, crawling, multimodal document processing, code execution, and mathematical computation. This enables the agent to acquire, process, and reason over external information in real time, supporting complex research workflows.

Empirical Evaluation

Memento is evaluated on four benchmarks: GAIA (long-horizon tool use), DeepResearcher (real-time web research), SimpleQA (factual precision), and HLE (long-tail academic reasoning). The agent achieves:

  • GAIA: 87.88% Pass@3 on validation and 79.40% on the test set, outperforming all open-source agent frameworks. Figure 3

Figure 3

Figure 3: Memento vs. Baselines on GAIA validation and test sets.

  • DeepResearcher: 66.6% F1 and 80.4% PM, surpassing state-of-the-art training-based systems.
  • SimpleQA: 95.0% accuracy, establishing a new state-of-the-art for factual reliability.
  • HLE: 24.4% PM, ranking second overall and outperforming several strong baselines. Figure 4

    Figure 4: Performance on SimpleQA and HLE, demonstrating Memento's superiority in factual and academic reasoning tasks.

Ablation studies show that both parametric and non-parametric CBR yield consistent, additive improvements across all benchmarks. Notably, case-based memory provides 4.7% to 9.6% absolute gains on out-of-distribution tasks, highlighting its role in generalization.

Continual Learning and Memory Efficiency

Memento demonstrates continual learning capability: as the case bank grows, performance improves over successive iterations, with rapid convergence observed after a few iterations due to the finite environment. The optimal number of retrieved cases is small (K=4), as larger K introduces noise and computational overhead without further gains.

The system is efficient in terms of output token usage, with most computational cost arising from integrating multi-step tool outputs as task complexity increases. The architecture is robust to hallucination and maintains concise, structured planning, with fast planners outperforming slow, deliberative ones in modular settings. Figure 5

Figure 5: The average number of each task type per level, highlighting the dominance of code, search, and crawl tasks as difficulty level increases.

Theoretical and Practical Implications

Memento provides a principled framework for continual, real-time adaptation of LLM agents without gradient updates. By decoupling agent learning from LLM parameter updates, it enables scalable, low-cost deployment in open-ended environments. The memory-augmented MDP formalism and CBR policy optimization bridge cognitive science and RL, offering a pathway for agents to accumulate, reuse, and generalize from experience.

The strong empirical results, especially on OOD tasks, challenge the assumption that parameter fine-tuning is necessary for agent adaptation. Instead, memory-based approaches can yield comparable or superior performance with greater efficiency and flexibility.

Future Directions

Potential extensions include:

  • Scaling to larger, more diverse memory banks with advanced curation and forgetting mechanisms to mitigate retrieval swamping.
  • Integrating richer forms of memory (e.g., semantic, procedural) and more sophisticated retrieval policies.
  • Applying the framework to multi-agent and collaborative research scenarios.
  • Exploring hybrid approaches that combine memory-based adaptation with lightweight parameter-efficient fine-tuning.

Conclusion

Memento demonstrates that memory-augmented, case-based reasoning enables LLM agents to achieve continual, real-time adaptation without fine-tuning LLM parameters. The framework achieves state-of-the-art results across multiple challenging benchmarks, with strong generalization and efficiency. These findings suggest that external memory and CBR are critical components for scalable, generalist LLM agents, and motivate further research into memory-based agent architectures for open-ended AI.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What this paper is about

This paper introduces Memento, a way to make AI agents (powered by LLMs) get better over time without retraining or changing the AI’s internal brain. Instead of “rewiring” the model, Memento helps the agent learn the way many students do: by keeping a smart notebook of past problems and solutions, and looking up similar cases when facing a new task.

The big questions the paper asks

  • How can we build AI agents that keep improving as they work in the real world, without the high cost and delay of retraining big AI models?
  • Can an agent learn from its own experiences—both successes and mistakes—by storing and reusing them like a case library?
  • Will this memory-first approach work on tough, long tasks that require planning, web browsing, tool use, and combining information from different places?

How the method works, in simple terms

The key idea: Learn from memory, not by retraining

  • Traditional methods often “fine-tune” the LLM’s parameters (its internal settings), which is slow, expensive, and risky (it can forget older skills).
  • Memento keeps the LLM frozen. Learning happens outside the model, in a growing memory of past “cases” (problem, plan, outcome).
  • When a new problem comes up, the agent searches its memory for similar cases, uses them as examples, and adapts the plan to the new situation.

Think of it like a student’s binder:

  • Each past assignment is saved with the question (state), the solution plan (action), and whether it worked (reward).
  • For a new assignment, the student flips through the binder to find similar problems and uses them as a guide.

The agent’s two-part design: Planner + Executor

  • Planner: Decides what to do next (breaks a big task into subtasks) and consults the memory to find helpful past cases.
  • Executor: Carries out each subtask using tools (like web search, code execution, file readers, image/video analysis) through a standard interface (MCP), then reports results back.

They take turns: the planner proposes, the executor tries, the planner reviews, and so on—until the task is done.

The case memory: Write and Read

  • Write: After finishing a task, the agent saves a case: what the problem was, what plan it tried, and whether it worked.
  • Read: For a new task, the agent fetches the most relevant past cases to guide planning.

There are two ways to Read:

  1. Non-parametric (simple and fast): Find cases whose problems are textually/semantically similar (like searching by meaning).
  2. Parametric (smarter ranking): Learn a small scoring function (not the LLM) that predicts which past case will help most now, based on past rewards. This doesn’t retrain the big LLM; it just trains a lightweight scorer to prefer better cases.

How it “learns which memories to trust”

The agent uses a gentle form of reinforcement learning to improve which cases it picks from memory. You can think of it like a music app learning your taste:

  • It tries recommending different “songs” (cases),
  • Sees if the “song” led to a good result,
  • And updates a score so good “songs” are more likely next time. This is done with a method related to “soft Q-learning,” which encourages trying diverse options while still favoring what works.

What the experiments show

The team tested Memento on four kinds of tough benchmarks that need planning, tool use, and real-time web research:

  • GAIA: complex, multi-step tasks that often require browsing and multiple tools.
  • DeepResearcher: open-domain web research across several QA datasets.
  • SimpleQA: concise, factual questions.
  • HLE (Humanity’s Last Exam): challenging academic reasoning across many subjects.

Main results (high-level takeaways):

  • Top-tier performance on GAIA: Memento ranked at the top of the validation set (about 87.9% Pass@3) and scored 79.4% on the private test set.
  • Strong open-web research: On the DeepResearcher suite, Memento reached around 66.6% F1 and 80.4% Partial Match, beating training-heavy systems.
  • Very high factual accuracy: About 95% Partial Match on SimpleQA.
  • Better generalization: Adding the case-based memory boosted results by roughly 4.7 to 9.6 percentage points on out-of-distribution tasks (harder, unusual cases).

Why this matters:

  • These are long, real-world tasks that need many steps, tool calls, and judgment. Memento shows you can get high performance without retraining the LLM—just by learning how to use memory well.

Why this is important

  • Cheaper, faster learning: No need to fine-tune huge models. The agent can improve during use by saving and reusing experiences.
  • Human-like strategy: It mirrors how people learn—store examples, recall similar situations, and adapt.
  • Safer updates: Because the core model stays fixed, you avoid the risk of “forgetting” old skills or drifting behavior after retraining.
  • Scalable and flexible: Works across domains (web research, tools, files, images, videos), and keeps adapting in real time.
  • Path to generalist agents: This memory-first design is a practical step toward agents that can keep learning new skills and handle open-ended tasks in the wild.

Key terms made simple

  • Case-Based Reasoning (CBR): Solving new problems by recalling and adapting solutions to similar past problems.
  • Memory (Case Bank): A growing library of problem–plan–result examples the agent can search.
  • Non-parametric vs. Parametric memory:
    • Non-parametric: Simple similarity search (like “find the closest match”).
    • Parametric: A small learned scorer ranks which cases are likely to help most (no big-model retraining).
  • Planner–Executor loop: The planner decides; the executor does; then the planner updates the plan based on results, repeating until done.

In short: Memento shows that giving AI agents a smart, learnable memory can replace expensive retraining, letting them improve on the fly and tackle complex, real-world tasks more like people do.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 62 tweets with 3304 likes about this paper.

alphaXiv