Papers
Topics
Authors
Recent
Search
2000 character limit reached

Latent Memory Policy Optimization (LMPO)

Updated 11 February 2026
  • LMPO is a reinforcement learning algorithm that uses a lightweight, trainable memory composer to generate compact, role-aware latent memories for multi-agent systems.
  • It integrates a transformer-based memory composer with token-level PPO to propagate task-level rewards directly through agent-specific latent memory embeddings.
  • LMPO effectively tackles challenges like memory homogenization and token inflation, demonstrating significant empirical gains and efficient inference across diverse benchmarks.

Latent Memory Policy Optimization (LMPO) is a reinforcement learning algorithm designed to propagate task-level optimization signals through compact, role-aware latent memories in multi-agent systems (MAS) powered by LLMs. Introduced as the core optimization strategy in the LatentMem framework, LMPO addresses two persistent challenges in existing memory-augmented MAS: memory homogenization—where all agents share undifferentiated context and lose role specificity—and information overload—where token costs balloon due to unstructured or overly granular historical storage. By coupling a lightweight, trainable memory composer with policy optimization over latent memories, LMPO enables agents to retain high-utility, agent-specific context in an efficient, end-to-end differentiable manner (Fu et al., 3 Feb 2026).

1. Motivation and Integration in LatentMem

LLM-based MAS frameworks typically struggle with (i) homogenized memory that erases distinctions between agent roles, and (ii) context window saturation due to excessive tokenization of agent experiences. LatentMem counters both by bifurcating the memory subsystem into:

  • An experience bank B\mathcal{B} that stores raw trajectories without incurring token inflation.
  • A transformer-based memory composer Cϕ\mathcal{C}_\phi that synthesizes a fixed-length latent memory embedding mj=σϕ(γαj,Tq)RL×Dm_j = \sigma_\phi(\gamma_{\alpha_j}, \mathcal{T}_q) \in \mathbb{R}^{L'\times D} for each agent, conditioned on its role profile γk\gamma_k and a small retrieved trajectory set Tq\mathcal{T}_q.

These latent tokens are appended to the hidden states of each frozen LLM policy πθk\pi_{\theta_k}, allowing the policies to consume retentive, role-sensitive context within fixed token budgets. LMPO governs the learning of the composer ϕ\phi, supplying gradients from task-level rewards directly through latent memories.

2. Formal RL Objective and Optimization Structure

The LMPO objective is formulated as a token-level, actor-critic variant tailored to the unique structure of the memory-injected policy. For a policy rollout trajectory

τ={(αj,pj,oj)}j=1H,\tau = \{(\alpha_j, p_j, o_j)\}_{j=1}^{H},

the probability of τ\tau under composer parameters ϕ\phi is

Pϕ(τq,Tq)=j=1Ht=1Tjπ~θαj(oj(t)pj,oj(<t),mj(ϕ)),\mathbb{P}_\phi(\tau|q, \mathcal{T}_q) = \prod_{j=1}^{H} \prod_{t=1}^{T_j} \tilde{\pi}_{\theta_{\alpha_j}}(o^{(t)}_j | p_j, o^{(<t)}_j, m_j(\phi)),

where mj(ϕ)m_j(\phi) is produced by the composer and π~θ\tilde{\pi}_{\theta} denotes the wrapped base policy.

To compare trajectories within a mini-batch, a group-based advantage is computed:

A^i=R(τ^i)1Gk=1GR(τ^k)std{R(τ^k)}+ϵ,\hat A_i = \frac{R(\hat\tau_i) - \frac{1}{G} \sum_{k=1}^G R(\hat\tau_k)}{\text{std}\{R(\hat\tau_k)\} + \epsilon},

with GG as the batch size.

The LMPO surrogate loss leverages PPO-style token-wise clipping:

JLMPO(ϕ)=Eq,Tq[1ijTi,ji=1Gj=1Hit=1Ti,jLi,j,t(ϕ)],\mathcal{J}_{\text{LMPO}}(\phi) = \mathbb{E}_{q,\mathcal{T}_q} \left[ \frac{1}{\sum_i \sum_j T_{i,j}} \sum_{i=1}^G \sum_{j=1}^{H_i} \sum_{t=1}^{T_{i,j}} \mathcal{L}_{i,j,t}(\phi) \right],

with

ri,j,t(ϕ)=π~θ(oi,j(t),mj(ϕ))π~θ(oi,j(t),mj(ϕold)),r_{i,j,t}(\phi) = \frac{\tilde{\pi}_\theta(o_{i,j}^{(t)} |\dots, m_j(\phi))}{\tilde{\pi}_\theta(o_{i,j}^{(t)} |\dots, m_j(\phi_\text{old}))},

and PPO clip loss

Li,j,t(ϕ)=min{ri,j,t(ϕ)A^i,clip(ri,j,t(ϕ),1ε,1+ε)A^i}.\mathcal{L}_{i,j,t}(\phi) = \min\left\{ r_{i,j,t}(\phi) \hat A_i, \text{clip}(r_{i,j,t}(\phi), 1-\varepsilon, 1+\varepsilon) \hat A_i \right\}.

The commonly used settings include a clip range ε=0.2\varepsilon=0.2 and no additional KL penalty.

3. Reinforcement Learning Signal Propagation via Latent Memory

As mj(ϕ)m_j(\phi) is injected directly as additional tokens into each agent's policy, the log-likelihood of every output token—and hence the entire trajectory's likelihood—becomes a direct function of the composer parameters ϕ\phi. The RL signal propagates according to:

ϕJLMPO=E[i,j,tϕlogπ~θ(oi,j(t),mj(ϕ))A^iwi,j,t],\nabla_\phi \mathcal{J}_{\text{LMPO}} = \mathbb{E}\left[ \sum_{i,j,t} \nabla_\phi \log \tilde{\pi}_\theta(o^{(t)}_{i,j} | \cdots, m_j(\phi)) \hat A_i w_{i,j,t} \right],

where wi,j,tw_{i,j,t} embodies the PPO clipping logic.

This design enables task-level reward signals R(τ)R(\tau) to traverse, via the latent memory, through gradients back to ϕ\phi, empowering the composer to shape agent memories in response to collective task utility. The policy backbones remain frozen, focusing optimization strictly on the composition of compact, high-utility memories.

4. Training Loop and Algorithmic Workflow

The end-to-end LMPO training loop operates as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Inputs: 
  – Experience bank 𝓑 = {τᵢ} (initial diverse trajectories)
  – Role profiles {γₖ}, frozen policies π_{θₖ}
  – Composer σ_φ, old parameters φ_old ← φ

Hyperparams: batch size B (queries), G rollouts/query, clip ε, LR α

Repeat for N training iterations:
  Sample B queries {q_b}
  For each query q:
    Retrieve top-K trajectories 𝒯_q from 𝓑 via cosine similarity
    For i = 1…G:
      Rollout τ̂_i ∼ P_{φ}(·|q,𝒯_q) with memory injection
      Compute reward R_i = R(τ̂_i)
    Compute mean μ, std σ of {R_i}, and advantages Â_i = (R_i−μ)/(σ+ε)
    For each τ̂_i, each step j, token t:
      Compute r_{i,j,t}(φ) = π_new / π_old
      Compute ℒ_{i,j,t} = min(r * Â_i, clip(r,1±ε)*Â_i)
  Aggregate loss 𝓛 = mean_{b,i,j,t}[ℒ_{i,j,t}]
  φ ← φ − α · AdamW(∇_φ 𝓛)
  φ_old ← φ
End
Output: Trained composer parameters φ*

Key implementation choices: no modification of the underlying LLM policies, role-profiles remain fixed, and the only trainable component is the composer—a parameter-efficient and focused optimization footprint.

5. Model Components and Hyperparameter Specification

The principal components and associated hyperparameters are:

Module Architecture/Setting Notable Hyperparameters
Experience Bank MiniLM-L6-v2 encoder (queries/trajectories) -
Memory Composer σ_φ 4-layer transformer decoder, hidden dim DD L=8L'=8 tokens, D=4096D=4096, 16 heads, LoRA (q,v), rank r=16r=16, α=32, dropout=0.1
Retrieval Cosine similarity; K=1K=1 by default K{1,3,5}K\in\{1,3,5\} in ablations
LMPO Training AdamW, clip ε=0.2\varepsilon=0.2, discount γ=1.0 LR $1e$-5, macro batch 32, micro 8, grad norm 1.0, mixed precision/DeepSpeed/vLLM

Initialization: the composer is seeded from the backbone LLM and refined by LoRA on attention projections, achieving parameter efficiency and rapid adaptation.

6. Empirical Performance and Benchmark Results

Benchmarking was carried out across both in-domain and distribution-shifted (out-of-domain) tasks, leveraging multiple mainstream MAS frameworks.

  • In-domain: TriviaQA, KodCode, StrategyQA, PopQA
  • Out-of-domain: BigCodeBench, PDDL
  • MAS frameworks: AutoGen, MacNet (training), CAMEL, DyLAN (testing)

Baselines included no-memory policies, Voyager, Generative, JoyAgent, MetaGPT, ChatDev, OAgents, G-Memory, and MARTI (direct multi-agent fine-tuning).

Key findings (Qwen3-4B backbone):

  • On AutoGen/TriviaQA: LMPO yields a +16.20%+16.20\% absolute gain over the no-memory baseline and +9+911%11\% over best memory-augmented baselines.
  • On MacNet/KodCode: +8.40%+8.40\% absolute improvement, attaining 78.90%78.90\% accuracy.
  • Out-of-domain generalization: On PDDL, LMPO achieves +7.10%+7.10\%, in contrast with $2$–4%4\% drops in other methods.
  • Unseen MAS: +7.90%+7.90\% and +9.21%+9.21\% gains for CAMEL and DyLAN, respectively.
  • Efficiency: 50% fewer tokens and 2x faster inference compared to OAgents/G-Memory.
  • Against MARTI (direct multi-agent LLM fine-tuning), LMPO-composed memories boost TriviaQA by up to +11.73%+11.73\% under matched compute.

7. Limitations, Advantages, and Prospects

Principal advantages of LMPO include:

  • Role-aware customization by conditioning composer inputs on γ\gamma, directly addressing memory homogenization.
  • Fixed-length latent tokens LL' guarantee bounded token budgets, preventing context fatigue.
  • End-to-end differentiability allows task rewards to shape memory utility directly.
  • Strong empirical generalization to new domains and unseen MAS architectures.

Identified limitations:

  • Reliance on on-line RL rollouts and reward evaluation may induce sample inefficiency.
  • Composer architecture and LoRA rank require tuning to balance expressivity and overfitting.
  • Policies remain frozen; joint composer-policy co-adaptation remains an open extension.

Potential future directions include:

  • Hierarchical latent memories that operate at multiple abstraction layers.
  • Adaptively choosing retrieval set size KK guided by uncertainty.
  • Sharing composers across agents to model inter-role memory dependencies.
  • Meta-learning strategies for rapid composer adaptation to unfamiliar MAS frameworks.

In summary, Latent Memory Policy Optimization is a token-level PPO variant exploiting the differentiability of latent memory to selectively train a small, role-conditioned composer. This procedure yields compact, agent-specific memories that provide substantive performance gains across a diversity of tasks and frameworks while controlling for token and compute cost (Fu et al., 3 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Latent Memory Policy Optimization (LMPO).