MLPO: Multi-agent Leader Policy Optimization

Updated 4 February 2026

The paper introduces MLPO, a framework where a single trainable leader LLM optimally synthesizes and evaluates fixed peer outputs for improved collaboration.
It employs a hierarchical RL structure with Group Relative Policy Optimization, leveraging a group reward mechanism for robust, efficient multi-agent coordination.
Empirical results demonstrate state-of-the-art accuracy on benchmarks with reduced computational cost compared to co-training all agents.

Multi-agent guided Leader Policy Optimization (MLPO) is a hierarchical reinforcement learning framework for collaborative reasoning in multi-agent LLM systems. MLPO trains a single leader LLM to coordinate a team of fixed, untrained peer agents, optimizing the leader’s ability to evaluate and synthesize their outputs for downstream task accuracy. Unlike prior multi-agent RL approaches that co-train all agents—or rely on explicit value networks or dense peer feedback—MLPO leverages a group reward mechanism to guide the leader’s autoregressive policy, enabling both efficient coordination and improved generalization in both multi-agent and standalone settings (Estornell et al., 11 Jul 2025).

1. Hierarchical Multi-Agent Framework

MLPO structures the system as a hierarchy comprising one trainable leader LLM $L_\theta$ and $K$ off-the-shelf peer agents $\{a_i\}_{i=1}^K$ . Inference proceeds in discrete rounds ( $T$ steps), where:

Round 0: Each peer independently generates a solution $s_i^{(0)} \sim a_i(x)$ for the input prompt $x$ . The leader ingests $x$ and all $s_i^{(0)}$ and outputs $o^{(0)} \sim L_\theta(x, s_1^{(0)}, ..., s_K^{(0)})$ , structured with a > chain-of-thought and an <answer> tag. > > - Rounds $t > 0$ : Each peer refines its previous answer $s_i^{(t-1)}$ conditioned on the leader’s prior output $o^{(t-1)}$ , producing $s_i^{(t)} \sim a_i(x, s_i^{(t-1)}, o^{(t-1)})$ . The leader then aggregates these into $o^{(t)}$ . After $T$ rounds, the final answer is read from the leader’s <answer> block. > > Crucially, only the leader $L_\theta$ is trainable; peer models remain fixed throughout. > > ## 2. Learning Algorithm and Objective > > The training procedure consists of two phases: optional supervised fine-tuning (SFT) followed by policy-gradient RL—specifically, Group Relative Policy Optimization (GRPO): > > - Policy Formulation: The leader’s policy is $\pi_\theta(o|x, \mathbf{s}) = L_\theta(o|x, \mathbf{s})$ , with $\mathbf{s}$ the set of peer responses and $o$ the combined token sequence. > > - Reward Structure: Each leader output trajectory receives a binary scalar reward $R \in \{0, 1\}$ , with $R=1$ for correctly formatted, correct answers; $R=0$ otherwise. > > - GRPO Objective: Extending PPO to group settings, for each prompt, $G$ independent leader outputs $\{o_i\}_{i=1}^G$ are sampled. The importance ratio $r_{i,t}$ is computed per token. The centered advantage $\hat{A}_{i,t} = R_i - \bar{R}$ , with $\bar{R}$ the group mean reward, is used. The MLPO (GRPO) loss maximized is: > > $\mathcal{J}_{\rm MLPO}(\theta) = \mathbb{E}_{x, \mathbf{s}, \{o_i\}}\left[\frac{1}{\sum_i |o_i|}\sum_{i=1}^G\sum_{t=1}^{|o_i|}\min\left(r_{i,t}\hat{A}_{i,t},\; \mathrm{clip}(r_{i,t}, 1-\epsilon, 1+\epsilon)\hat{A}_{i,t}\right)\right]$ > > Practical considerations include omitting value networks, using grouped normalization ("Dr.GRPO"), KL-penalization for stability, and $\epsilon \approx 0.1$ –$0.2$. > > ## 3. Implicit Evaluation and Synthesis Capabilities > > MLPO eschews explicit value networks or peer feedback. Instead, the leader receives its reward signal based on the end-to-end utility of reasoning over the peer set, resulting in: > > - Implicit Peer Response Evaluation: The policy gradient incentivizes the leader to preferentially internalize reasoning patterns that correctly synthesize or reject peer proposals in a reward-maximizing manner. > > - Internalized Criticization: Through GRPO’s centered advantage signals, the leader acquires an implicit value function, enabling it to both synthesize actionable reasoning and self-evaluate the contribution of each peer response. > > ## 4. Implementation Specifics > > - Leader LLM Architecture: Qwen2.5 7B Instruct (~7B params). > > - Peer Agents: Llama 3.1 8B, Gemma2 9B, and an additional Qwen2.5 7B; total “team” size ~24B parameters. > > - Supervised Fine-Tuning (SFT): Instills backtracking via synthetic demonstrations, training the leader to reconcile correct versus incorrect samples, using ~16 leader samples per prompt across ~1 epoch. > > - MLPO Training: For each prompt, 4 responses are sampled per agent, 4 distinct group-prompts are constructed, and “easy” tasks (≥75% peer-correct) are filtered out. Each group sample undergoes 4 leader rollouts. Hyperparameters include a learning rate of $2\times10^{-6}$ , batch sizes of 64–128, and ~50K–100K gradient steps. > > - Computation: Only $L_\theta$ is optimized, reducing FLOPs by approximately a factor $1/K$ versus multi-agent alternatives (e.g., ACC-Collab). At inference, the MLPO system requires $K$ peer queries and 1 leader query per round—incurring higher per-task token costs than single-agent baselines, yet with significant efficiency improvements over methods requiring all-agents retraining or multi-critic querying. > > ## 5. Empirical Performance and Ablations > > MLPO delivers state-of-the-art results on several benchmarks: > > | Setting | MMLU Accuracy | BBH Accuracy | MATH Accuracy | > |----------------------------|:-------------:|:-------------:|:-------------:| > | SFT+MLPO (5 rounds) | 0.782±0.006 | 0.882±0.005 | 0.762±0.005 | > | Single-agent RL (baseline) | 0.75–0.77 | 0.80–0.83 | 0.70–0.72 | > | MLPO Zero-shot (no peers) | 0.757 | 0.855 | 0.729 | > | Classic GRPO Zero-shot | 0.742 | 0.791 | 0.712 | > > Ablation results include: > > - Multi-round MLPO+ (finetuning on later-round peer data) yields further gains: BBH 0.920, MMLU 0.792, MATH 0.771. > > - Filtering “easy” tasks provides ∼1.3% BBH gain (from 0.869→0.882). > > - Using 4 agent-solution sets per task achieves optimal trade-off compared to 1 or 8 (BBH 0.917 for 4 sets). > > - Leader access to both agent reasoning and final answers outperforms partial information conditions. > > Test-time scaling, under a fixed 40-sample budget, demonstrates 3–5% gains over self-consistency baselines due to peer and leader output diversity. > > ## 6. Efficiency, Limitations, and Extensions > > MLPO’s efficiency derives from: > > - Training just a single leader, avoiding the computational and stability issues of updating all $K$ agents. > > - No auxiliary critic or value head, leveraging group reward centering for implicit credit assignment. > > - FLOP reduction proportional to $1/K$ compared to multi-agent RL. > > Limitations include: > > - Increased context window size due to concatenating $K$ peer responses. > > - Sequential peer→leader inference, reducing parallelism and increasing latency. > > - Higher per-prompt memory and compute at inference than single-agent baselines. > > Potential research directions and extensions involve selective/spare peer querying, retrieval and caching of peer solutions, hybrid integration with lightweight differentiable critics, and dynamic adaptation of team size $K$ or round count $T$ (Estornell et al., 11 Jul 2025). > > ## 7. Comparative Perspective and Insights > > MLPO demonstrates that hierarchical delegation—training only a single, flexible leader LLM—can subsume both aggregation (“actor”) and evaluation (“critic”) functions usually distributed among agents or value networks. This design choice circumvents the instability and inefficiency of multi-agent or multi-critic systems, while providing improved accuracy and robust generalization. Empirical findings indicate that leader exposure to heterogeneous peer outputs during training provides a richer exploration space even for subsequent single-agent deployment. Limitations related to context size and sequential computation remain areas for methodological innovation.

Markdown Report Issue Upgrade to Chat

References (1)

How to Train a Leader: Hierarchical Reasoning in Multi-Agent LLMs (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-agent guided Leader Policy Optimization (MLPO).