Latent Memory Policy Optimization (LMPO)
- LMPO is a reinforcement learning algorithm that uses a lightweight, trainable memory composer to generate compact, role-aware latent memories for multi-agent systems.
- It integrates a transformer-based memory composer with token-level PPO to propagate task-level rewards directly through agent-specific latent memory embeddings.
- LMPO effectively tackles challenges like memory homogenization and token inflation, demonstrating significant empirical gains and efficient inference across diverse benchmarks.
Latent Memory Policy Optimization (LMPO) is a reinforcement learning algorithm designed to propagate task-level optimization signals through compact, role-aware latent memories in multi-agent systems (MAS) powered by LLMs. Introduced as the core optimization strategy in the LatentMem framework, LMPO addresses two persistent challenges in existing memory-augmented MAS: memory homogenization—where all agents share undifferentiated context and lose role specificity—and information overload—where token costs balloon due to unstructured or overly granular historical storage. By coupling a lightweight, trainable memory composer with policy optimization over latent memories, LMPO enables agents to retain high-utility, agent-specific context in an efficient, end-to-end differentiable manner (Fu et al., 3 Feb 2026).
1. Motivation and Integration in LatentMem
LLM-based MAS frameworks typically struggle with (i) homogenized memory that erases distinctions between agent roles, and (ii) context window saturation due to excessive tokenization of agent experiences. LatentMem counters both by bifurcating the memory subsystem into:
- An experience bank that stores raw trajectories without incurring token inflation.
- A transformer-based memory composer that synthesizes a fixed-length latent memory embedding for each agent, conditioned on its role profile and a small retrieved trajectory set .
These latent tokens are appended to the hidden states of each frozen LLM policy , allowing the policies to consume retentive, role-sensitive context within fixed token budgets. LMPO governs the learning of the composer , supplying gradients from task-level rewards directly through latent memories.
2. Formal RL Objective and Optimization Structure
The LMPO objective is formulated as a token-level, actor-critic variant tailored to the unique structure of the memory-injected policy. For a policy rollout trajectory
the probability of under composer parameters is
where is produced by the composer and denotes the wrapped base policy.
To compare trajectories within a mini-batch, a group-based advantage is computed:
with as the batch size.
The LMPO surrogate loss leverages PPO-style token-wise clipping:
with
and PPO clip loss
The commonly used settings include a clip range and no additional KL penalty.
3. Reinforcement Learning Signal Propagation via Latent Memory
As is injected directly as additional tokens into each agent's policy, the log-likelihood of every output token—and hence the entire trajectory's likelihood—becomes a direct function of the composer parameters . The RL signal propagates according to:
where embodies the PPO clipping logic.
This design enables task-level reward signals to traverse, via the latent memory, through gradients back to , empowering the composer to shape agent memories in response to collective task utility. The policy backbones remain frozen, focusing optimization strictly on the composition of compact, high-utility memories.
4. Training Loop and Algorithmic Workflow
The end-to-end LMPO training loop operates as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
Inputs:
– Experience bank 𝓑 = {τᵢ} (initial diverse trajectories)
– Role profiles {γₖ}, frozen policies π_{θₖ}
– Composer σ_φ, old parameters φ_old ← φ
Hyperparams: batch size B (queries), G rollouts/query, clip ε, LR α
Repeat for N training iterations:
Sample B queries {q_b}
For each query q:
Retrieve top-K trajectories 𝒯_q from 𝓑 via cosine similarity
For i = 1…G:
Rollout τ̂_i ∼ P_{φ}(·|q,𝒯_q) with memory injection
Compute reward R_i = R(τ̂_i)
Compute mean μ, std σ of {R_i}, and advantages Â_i = (R_i−μ)/(σ+ε)
For each τ̂_i, each step j, token t:
Compute r_{i,j,t}(φ) = π_new / π_old
Compute ℒ_{i,j,t} = min(r * Â_i, clip(r,1±ε)*Â_i)
Aggregate loss 𝓛 = mean_{b,i,j,t}[ℒ_{i,j,t}]
φ ← φ − α · AdamW(∇_φ 𝓛)
φ_old ← φ
End
Output: Trained composer parameters φ* |
Key implementation choices: no modification of the underlying LLM policies, role-profiles remain fixed, and the only trainable component is the composer—a parameter-efficient and focused optimization footprint.
5. Model Components and Hyperparameter Specification
The principal components and associated hyperparameters are:
| Module | Architecture/Setting | Notable Hyperparameters |
|---|---|---|
| Experience Bank | MiniLM-L6-v2 encoder (queries/trajectories) | - |
| Memory Composer σ_φ | 4-layer transformer decoder, hidden dim | tokens, , 16 heads, LoRA (q,v), rank , α=32, dropout=0.1 |
| Retrieval | Cosine similarity; by default | in ablations |
| LMPO Training | AdamW, clip , discount γ=1.0 | LR $1e$-5, macro batch 32, micro 8, grad norm 1.0, mixed precision/DeepSpeed/vLLM |
Initialization: the composer is seeded from the backbone LLM and refined by LoRA on attention projections, achieving parameter efficiency and rapid adaptation.
6. Empirical Performance and Benchmark Results
Benchmarking was carried out across both in-domain and distribution-shifted (out-of-domain) tasks, leveraging multiple mainstream MAS frameworks.
- In-domain: TriviaQA, KodCode, StrategyQA, PopQA
- Out-of-domain: BigCodeBench, PDDL
- MAS frameworks: AutoGen, MacNet (training), CAMEL, DyLAN (testing)
Baselines included no-memory policies, Voyager, Generative, JoyAgent, MetaGPT, ChatDev, OAgents, G-Memory, and MARTI (direct multi-agent fine-tuning).
Key findings (Qwen3-4B backbone):
- On AutoGen/TriviaQA: LMPO yields a absolute gain over the no-memory baseline and – over best memory-augmented baselines.
- On MacNet/KodCode: absolute improvement, attaining accuracy.
- Out-of-domain generalization: On PDDL, LMPO achieves , in contrast with $2$– drops in other methods.
- Unseen MAS: and gains for CAMEL and DyLAN, respectively.
- Efficiency: 50% fewer tokens and 2x faster inference compared to OAgents/G-Memory.
- Against MARTI (direct multi-agent LLM fine-tuning), LMPO-composed memories boost TriviaQA by up to under matched compute.
7. Limitations, Advantages, and Prospects
Principal advantages of LMPO include:
- Role-aware customization by conditioning composer inputs on , directly addressing memory homogenization.
- Fixed-length latent tokens guarantee bounded token budgets, preventing context fatigue.
- End-to-end differentiability allows task rewards to shape memory utility directly.
- Strong empirical generalization to new domains and unseen MAS architectures.
Identified limitations:
- Reliance on on-line RL rollouts and reward evaluation may induce sample inefficiency.
- Composer architecture and LoRA rank require tuning to balance expressivity and overfitting.
- Policies remain frozen; joint composer-policy co-adaptation remains an open extension.
Potential future directions include:
- Hierarchical latent memories that operate at multiple abstraction layers.
- Adaptively choosing retrieval set size guided by uncertainty.
- Sharing composers across agents to model inter-role memory dependencies.
- Meta-learning strategies for rapid composer adaptation to unfamiliar MAS frameworks.
In summary, Latent Memory Policy Optimization is a token-level PPO variant exploiting the differentiability of latent memory to selectively train a small, role-conditioned composer. This procedure yields compact, agent-specific memories that provide substantive performance gains across a diversity of tasks and frameworks while controlling for token and compute cost (Fu et al., 3 Feb 2026).