Papers
Topics
Authors
Recent
Search
2000 character limit reached

Latent Memory Policy Optimization (LMPO)

Updated 11 February 2026
  • LMPO is a reinforcement learning algorithm that uses a lightweight, trainable memory composer to generate compact, role-aware latent memories for multi-agent systems.
  • It integrates a transformer-based memory composer with token-level PPO to propagate task-level rewards directly through agent-specific latent memory embeddings.
  • LMPO effectively tackles challenges like memory homogenization and token inflation, demonstrating significant empirical gains and efficient inference across diverse benchmarks.

Latent Memory Policy Optimization (LMPO) is a reinforcement learning algorithm designed to propagate task-level optimization signals through compact, role-aware latent memories in multi-agent systems (MAS) powered by LLMs. Introduced as the core optimization strategy in the LatentMem framework, LMPO addresses two persistent challenges in existing memory-augmented MAS: memory homogenization—where all agents share undifferentiated context and lose role specificity—and information overload—where token costs balloon due to unstructured or overly granular historical storage. By coupling a lightweight, trainable memory composer with policy optimization over latent memories, LMPO enables agents to retain high-utility, agent-specific context in an efficient, end-to-end differentiable manner (Fu et al., 3 Feb 2026).

1. Motivation and Integration in LatentMem

LLM-based MAS frameworks typically struggle with (i) homogenized memory that erases distinctions between agent roles, and (ii) context window saturation due to excessive tokenization of agent experiences. LatentMem counters both by bifurcating the memory subsystem into:

  • An experience bank B\mathcal{B} that stores raw trajectories without incurring token inflation.
  • A transformer-based memory composer Cϕ\mathcal{C}_\phi that synthesizes a fixed-length latent memory embedding mj=σϕ(γαj,Tq)RL×Dm_j = \sigma_\phi(\gamma_{\alpha_j}, \mathcal{T}_q) \in \mathbb{R}^{L'\times D} for each agent, conditioned on its role profile γk\gamma_k and a small retrieved trajectory set Tq\mathcal{T}_q.

These latent tokens are appended to the hidden states of each frozen LLM policy πθk\pi_{\theta_k}, allowing the policies to consume retentive, role-sensitive context within fixed token budgets. LMPO governs the learning of the composer ϕ\phi, supplying gradients from task-level rewards directly through latent memories.

2. Formal RL Objective and Optimization Structure

The LMPO objective is formulated as a token-level, actor-critic variant tailored to the unique structure of the memory-injected policy. For a policy rollout trajectory

τ={(αj,pj,oj)}j=1H,\tau = \{(\alpha_j, p_j, o_j)\}_{j=1}^{H},

the probability of τ\tau under composer parameters ϕ\phi is

Cϕ\mathcal{C}_\phi0

where Cϕ\mathcal{C}_\phi1 is produced by the composer and Cϕ\mathcal{C}_\phi2 denotes the wrapped base policy.

To compare trajectories within a mini-batch, a group-based advantage is computed:

Cϕ\mathcal{C}_\phi3

with Cϕ\mathcal{C}_\phi4 as the batch size.

The LMPO surrogate loss leverages PPO-style token-wise clipping:

Cϕ\mathcal{C}_\phi5

with

Cϕ\mathcal{C}_\phi6

and PPO clip loss

Cϕ\mathcal{C}_\phi7

The commonly used settings include a clip range Cϕ\mathcal{C}_\phi8 and no additional KL penalty.

3. Reinforcement Learning Signal Propagation via Latent Memory

As Cϕ\mathcal{C}_\phi9 is injected directly as additional tokens into each agent's policy, the log-likelihood of every output token—and hence the entire trajectory's likelihood—becomes a direct function of the composer parameters mj=σϕ(γαj,Tq)RL×Dm_j = \sigma_\phi(\gamma_{\alpha_j}, \mathcal{T}_q) \in \mathbb{R}^{L'\times D}0. The RL signal propagates according to:

mj=σϕ(γαj,Tq)RL×Dm_j = \sigma_\phi(\gamma_{\alpha_j}, \mathcal{T}_q) \in \mathbb{R}^{L'\times D}1

where mj=σϕ(γαj,Tq)RL×Dm_j = \sigma_\phi(\gamma_{\alpha_j}, \mathcal{T}_q) \in \mathbb{R}^{L'\times D}2 embodies the PPO clipping logic.

This design enables task-level reward signals mj=σϕ(γαj,Tq)RL×Dm_j = \sigma_\phi(\gamma_{\alpha_j}, \mathcal{T}_q) \in \mathbb{R}^{L'\times D}3 to traverse, via the latent memory, through gradients back to mj=σϕ(γαj,Tq)RL×Dm_j = \sigma_\phi(\gamma_{\alpha_j}, \mathcal{T}_q) \in \mathbb{R}^{L'\times D}4, empowering the composer to shape agent memories in response to collective task utility. The policy backbones remain frozen, focusing optimization strictly on the composition of compact, high-utility memories.

4. Training Loop and Algorithmic Workflow

The end-to-end LMPO training loop operates as follows:

Tq\mathcal{T}_q7

Key implementation choices: no modification of the underlying LLM policies, role-profiles remain fixed, and the only trainable component is the composer—a parameter-efficient and focused optimization footprint.

5. Model Components and Hyperparameter Specification

The principal components and associated hyperparameters are:

Module Architecture/Setting Notable Hyperparameters
Experience Bank MiniLM-L6-v2 encoder (queries/trajectories) -
Memory Composer σ_φ 4-layer transformer decoder, hidden dim mj=σϕ(γαj,Tq)RL×Dm_j = \sigma_\phi(\gamma_{\alpha_j}, \mathcal{T}_q) \in \mathbb{R}^{L'\times D}5 mj=σϕ(γαj,Tq)RL×Dm_j = \sigma_\phi(\gamma_{\alpha_j}, \mathcal{T}_q) \in \mathbb{R}^{L'\times D}6 tokens, mj=σϕ(γαj,Tq)RL×Dm_j = \sigma_\phi(\gamma_{\alpha_j}, \mathcal{T}_q) \in \mathbb{R}^{L'\times D}7, 16 heads, LoRA (q,v), rank mj=σϕ(γαj,Tq)RL×Dm_j = \sigma_\phi(\gamma_{\alpha_j}, \mathcal{T}_q) \in \mathbb{R}^{L'\times D}8, α=32, dropout=0.1
Retrieval Cosine similarity; mj=σϕ(γαj,Tq)RL×Dm_j = \sigma_\phi(\gamma_{\alpha_j}, \mathcal{T}_q) \in \mathbb{R}^{L'\times D}9 by default γk\gamma_k0 in ablations
LMPO Training AdamW, clip γk\gamma_k1, discount γ=1.0 LR γk\gamma_k2-5, macro batch 32, micro 8, grad norm 1.0, mixed precision/DeepSpeed/vLLM

Initialization: the composer is seeded from the backbone LLM and refined by LoRA on attention projections, achieving parameter efficiency and rapid adaptation.

6. Empirical Performance and Benchmark Results

Benchmarking was carried out across both in-domain and distribution-shifted (out-of-domain) tasks, leveraging multiple mainstream MAS frameworks.

  • In-domain: TriviaQA, KodCode, StrategyQA, PopQA
  • Out-of-domain: BigCodeBench, PDDL
  • MAS frameworks: AutoGen, MacNet (training), CAMEL, DyLAN (testing)

Baselines included no-memory policies, Voyager, Generative, JoyAgent, MetaGPT, ChatDev, OAgents, G-Memory, and MARTI (direct multi-agent fine-tuning).

Key findings (Qwen3-4B backbone):

  • On AutoGen/TriviaQA: LMPO yields a γk\gamma_k3 absolute gain over the no-memory baseline and γk\gamma_k4–γk\gamma_k5 over best memory-augmented baselines.
  • On MacNet/KodCode: γk\gamma_k6 absolute improvement, attaining γk\gamma_k7 accuracy.
  • Out-of-domain generalization: On PDDL, LMPO achieves γk\gamma_k8, in contrast with γk\gamma_k9–Tq\mathcal{T}_q0 drops in other methods.
  • Unseen MAS: Tq\mathcal{T}_q1 and Tq\mathcal{T}_q2 gains for CAMEL and DyLAN, respectively.
  • Efficiency: 50% fewer tokens and 2x faster inference compared to OAgents/G-Memory.
  • Against MARTI (direct multi-agent LLM fine-tuning), LMPO-composed memories boost TriviaQA by up to Tq\mathcal{T}_q3 under matched compute.

7. Limitations, Advantages, and Prospects

Principal advantages of LMPO include:

  • Role-aware customization by conditioning composer inputs on Tq\mathcal{T}_q4, directly addressing memory homogenization.
  • Fixed-length latent tokens Tq\mathcal{T}_q5 guarantee bounded token budgets, preventing context fatigue.
  • End-to-end differentiability allows task rewards to shape memory utility directly.
  • Strong empirical generalization to new domains and unseen MAS architectures.

Identified limitations:

  • Reliance on on-line RL rollouts and reward evaluation may induce sample inefficiency.
  • Composer architecture and LoRA rank require tuning to balance expressivity and overfitting.
  • Policies remain frozen; joint composer-policy co-adaptation remains an open extension.

Potential future directions include:

  • Hierarchical latent memories that operate at multiple abstraction layers.
  • Adaptively choosing retrieval set size Tq\mathcal{T}_q6 guided by uncertainty.
  • Sharing composers across agents to model inter-role memory dependencies.
  • Meta-learning strategies for rapid composer adaptation to unfamiliar MAS frameworks.

In summary, Latent Memory Policy Optimization is a token-level PPO variant exploiting the differentiability of latent memory to selectively train a small, role-conditioned composer. This procedure yields compact, agent-specific memories that provide substantive performance gains across a diversity of tasks and frameworks while controlling for token and compute cost (Fu et al., 3 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Latent Memory Policy Optimization (LMPO).