Latent Memory Policy Optimization (LMPO)

Updated 11 February 2026

LMPO is a reinforcement learning algorithm that uses a lightweight, trainable memory composer to generate compact, role-aware latent memories for multi-agent systems.
It integrates a transformer-based memory composer with token-level PPO to propagate task-level rewards directly through agent-specific latent memory embeddings.
LMPO effectively tackles challenges like memory homogenization and token inflation, demonstrating significant empirical gains and efficient inference across diverse benchmarks.

Latent Memory Policy Optimization (LMPO) is a reinforcement learning algorithm designed to propagate task-level optimization signals through compact, role-aware latent memories in multi-agent systems (MAS) powered by LLMs. Introduced as the core optimization strategy in the LatentMem framework, LMPO addresses two persistent challenges in existing memory-augmented MAS: memory homogenization—where all agents share undifferentiated context and lose role specificity—and information overload—where token costs balloon due to unstructured or overly granular historical storage. By coupling a lightweight, trainable memory composer with policy optimization over latent memories, LMPO enables agents to retain high-utility, agent-specific context in an efficient, end-to-end differentiable manner (Fu et al., 3 Feb 2026).

1. Motivation and Integration in LatentMem

LLM-based MAS frameworks typically struggle with (i) homogenized memory that erases distinctions between agent roles, and (ii) context window saturation due to excessive tokenization of agent experiences. LatentMem counters both by bifurcating the memory subsystem into:

An experience bank $\mathcal{B}$ that stores raw trajectories without incurring token inflation.
A transformer-based memory composer $\mathcal{C}_\phi$ that synthesizes a fixed-length latent memory embedding $m_j = \sigma_\phi(\gamma_{\alpha_j}, \mathcal{T}_q) \in \mathbb{R}^{L'\times D}$ for each agent, conditioned on its role profile $\gamma_k$ and a small retrieved trajectory set $\mathcal{T}_q$ .

These latent tokens are appended to the hidden states of each frozen LLM policy $\pi_{\theta_k}$ , allowing the policies to consume retentive, role-sensitive context within fixed token budgets. LMPO governs the learning of the composer $\phi$ , supplying gradients from task-level rewards directly through latent memories.

2. Formal RL Objective and Optimization Structure

The LMPO objective is formulated as a token-level, actor-critic variant tailored to the unique structure of the memory-injected policy. For a policy rollout trajectory

$\tau = \{(\alpha_j, p_j, o_j)\}_{j=1}^{H},$

the probability of $\tau$ under composer parameters $\phi$ is

$\mathcal{C}_\phi$ 0

where $\mathcal{C}_\phi$ 1 is produced by the composer and $\mathcal{C}_\phi$ 2 denotes the wrapped base policy.

To compare trajectories within a mini-batch, a group-based advantage is computed:

$\mathcal{C}_\phi$ 3

with $\mathcal{C}_\phi$ 4 as the batch size.

The LMPO surrogate loss leverages PPO-style token-wise clipping:

$\mathcal{C}_\phi$ 5

with

$\mathcal{C}_\phi$ 6

and PPO clip loss

$\mathcal{C}_\phi$ 7

The commonly used settings include a clip range $\mathcal{C}_\phi$ 8 and no additional KL penalty.

3. Reinforcement Learning Signal Propagation via Latent Memory

As $\mathcal{C}_\phi$ 9 is injected directly as additional tokens into each agent's policy, the log-likelihood of every output token—and hence the entire trajectory's likelihood—becomes a direct function of the composer parameters $m_j = \sigma_\phi(\gamma_{\alpha_j}, \mathcal{T}_q) \in \mathbb{R}^{L'\times D}$ 0. The RL signal propagates according to:

$m_j = \sigma_\phi(\gamma_{\alpha_j}, \mathcal{T}_q) \in \mathbb{R}^{L'\times D}$ 1

where $m_j = \sigma_\phi(\gamma_{\alpha_j}, \mathcal{T}_q) \in \mathbb{R}^{L'\times D}$ 2 embodies the PPO clipping logic.

This design enables task-level reward signals $m_j = \sigma_\phi(\gamma_{\alpha_j}, \mathcal{T}_q) \in \mathbb{R}^{L'\times D}$ 3 to traverse, via the latent memory, through gradients back to $m_j = \sigma_\phi(\gamma_{\alpha_j}, \mathcal{T}_q) \in \mathbb{R}^{L'\times D}$ 4, empowering the composer to shape agent memories in response to collective task utility. The policy backbones remain frozen, focusing optimization strictly on the composition of compact, high-utility memories.

4. Training Loop and Algorithmic Workflow

The end-to-end LMPO training loop operates as follows:

$\mathcal{T}_q$ 7

Key implementation choices: no modification of the underlying LLM policies, role-profiles remain fixed, and the only trainable component is the composer—a parameter-efficient and focused optimization footprint.

5. Model Components and Hyperparameter Specification

The principal components and associated hyperparameters are:

Module	Architecture/Setting	Notable Hyperparameters
Experience Bank	MiniLM-L6-v2 encoder (queries/trajectories)	-
Memory Composer σ_φ	4-layer transformer decoder, hidden dim $m_j = \sigma_\phi(\gamma_{\alpha_j}, \mathcal{T}_q) \in \mathbb{R}^{L'\times D}$ 5	$m_j = \sigma_\phi(\gamma_{\alpha_j}, \mathcal{T}_q) \in \mathbb{R}^{L'\times D}$ 6 tokens, $m_j = \sigma_\phi(\gamma_{\alpha_j}, \mathcal{T}_q) \in \mathbb{R}^{L'\times D}$ 7, 16 heads, LoRA (q,v), rank $m_j = \sigma_\phi(\gamma_{\alpha_j}, \mathcal{T}_q) \in \mathbb{R}^{L'\times D}$ 8, α=32, dropout=0.1
Retrieval	Cosine similarity; $m_j = \sigma_\phi(\gamma_{\alpha_j}, \mathcal{T}_q) \in \mathbb{R}^{L'\times D}$ 9 by default	$\gamma_k$ 0 in ablations
LMPO Training	AdamW, clip $\gamma_k$ 1, discount γ=1.0	LR $\gamma_k$ 2-5, macro batch 32, micro 8, grad norm 1.0, mixed precision/DeepSpeed/vLLM

Initialization: the composer is seeded from the backbone LLM and refined by LoRA on attention projections, achieving parameter efficiency and rapid adaptation.

6. Empirical Performance and Benchmark Results

Benchmarking was carried out across both in-domain and distribution-shifted (out-of-domain) tasks, leveraging multiple mainstream MAS frameworks.

In-domain: TriviaQA, KodCode, StrategyQA, PopQA
Out-of-domain: BigCodeBench, PDDL
MAS frameworks: AutoGen, MacNet (training), CAMEL, DyLAN (testing)

Baselines included no-memory policies, Voyager, Generative, JoyAgent, MetaGPT, ChatDev, OAgents, G-Memory, and MARTI (direct multi-agent fine-tuning).

Key findings (Qwen3-4B backbone):

On AutoGen/TriviaQA: LMPO yields a $\gamma_k$ 3 absolute gain over the no-memory baseline and $\gamma_k$ 4– $\gamma_k$ 5 over best memory-augmented baselines.
On MacNet/KodCode: $\gamma_k$ 6 absolute improvement, attaining $\gamma_k$ 7 accuracy.
Out-of-domain generalization: On PDDL, LMPO achieves $\gamma_k$ 8, in contrast with $\gamma_k$ 9– $\mathcal{T}_q$ 0 drops in other methods.
Unseen MAS: $\mathcal{T}_q$ 1 and $\mathcal{T}_q$ 2 gains for CAMEL and DyLAN, respectively.
Efficiency: 50% fewer tokens and 2x faster inference compared to OAgents/G-Memory.
Against MARTI (direct multi-agent LLM fine-tuning), LMPO-composed memories boost TriviaQA by up to $\mathcal{T}_q$ 3 under matched compute.

7. Limitations, Advantages, and Prospects

Principal advantages of LMPO include:

Role-aware customization by conditioning composer inputs on $\mathcal{T}_q$ 4, directly addressing memory homogenization.
Fixed-length latent tokens $\mathcal{T}_q$ 5 guarantee bounded token budgets, preventing context fatigue.
End-to-end differentiability allows task rewards to shape memory utility directly.
Strong empirical generalization to new domains and unseen MAS architectures.

Identified limitations:

Reliance on on-line RL rollouts and reward evaluation may induce sample inefficiency.
Composer architecture and LoRA rank require tuning to balance expressivity and overfitting.
Policies remain frozen; joint composer-policy co-adaptation remains an open extension.

Potential future directions include:

Hierarchical latent memories that operate at multiple abstraction layers.
Adaptively choosing retrieval set size $\mathcal{T}_q$ 6 guided by uncertainty.
Sharing composers across agents to model inter-role memory dependencies.
Meta-learning strategies for rapid composer adaptation to unfamiliar MAS frameworks.

In summary, Latent Memory Policy Optimization is a token-level PPO variant exploiting the differentiability of latent memory to selectively train a small, role-conditioned composer. This procedure yields compact, agent-specific memories that provide substantive performance gains across a diversity of tasks and frameworks while controlling for token and compute cost (Fu et al., 3 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

LatentMem: Customizing Latent Memory for Multi-Agent Systems (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Latent Memory Policy Optimization (LMPO).

Latent Memory Policy Optimization (LMPO)

1. Motivation and Integration in LatentMem

2. Formal RL Objective and Optimization Structure

3. Reinforcement Learning Signal Propagation via Latent Memory

4. Training Loop and Algorithmic Workflow

5. Model Components and Hyperparameter Specification

6. Empirical Performance and Benchmark Results

7. Limitations, Advantages, and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Latent Memory Policy Optimization (LMPO)

1. Motivation and Integration in LatentMem

2. Formal RL Objective and Optimization Structure

3. Reinforcement Learning Signal Propagation via Latent Memory

4. Training Loop and Algorithmic Workflow

5. Model Components and Hyperparameter Specification

6. Empirical Performance and Benchmark Results

7. Limitations, Advantages, and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research