Papers
Topics
Authors
Recent
Search
2000 character limit reached

MLPO: Multi-agent Leader Policy Optimization

Updated 4 February 2026
  • The paper introduces MLPO, a framework where a single trainable leader LLM optimally synthesizes and evaluates fixed peer outputs for improved collaboration.
  • It employs a hierarchical RL structure with Group Relative Policy Optimization, leveraging a group reward mechanism for robust, efficient multi-agent coordination.
  • Empirical results demonstrate state-of-the-art accuracy on benchmarks with reduced computational cost compared to co-training all agents.

Multi-agent guided Leader Policy Optimization (MLPO) is a hierarchical reinforcement learning framework for collaborative reasoning in multi-agent LLM systems. MLPO trains a single leader LLM to coordinate a team of fixed, untrained peer agents, optimizing the leader’s ability to evaluate and synthesize their outputs for downstream task accuracy. Unlike prior multi-agent RL approaches that co-train all agents—or rely on explicit value networks or dense peer feedback—MLPO leverages a group reward mechanism to guide the leader’s autoregressive policy, enabling both efficient coordination and improved generalization in both multi-agent and standalone settings (Estornell et al., 11 Jul 2025).

1. Hierarchical Multi-Agent Framework

MLPO structures the system as a hierarchy comprising one trainable leader LLM LθL_\theta and KK off-the-shelf peer agents {ai}i=1K\{a_i\}_{i=1}^K. Inference proceeds in discrete rounds (TT steps), where:

  • Round 0: Each peer independently generates a solution si(0)ai(x)s_i^{(0)} \sim a_i(x) for the input prompt xx. The leader ingests xx and all si(0)s_i^{(0)} and outputs o(0)Lθ(x,s1(0),...,sK(0))o^{(0)} \sim L_\theta(x, s_1^{(0)}, ..., s_K^{(0)}), structured with a > chain-of-thought and an <answer> tag. > > - Rounds t>0t > 0: Each peer refines its previous answer si(t1)s_i^{(t-1)} conditioned on the leader’s prior output o(t1)o^{(t-1)}, producing si(t)ai(x,si(t1),o(t1))s_i^{(t)} \sim a_i(x, s_i^{(t-1)}, o^{(t-1)}). The leader then aggregates these into o(t)o^{(t)}. After TT rounds, the final answer is read from the leader’s <answer> block. > > Crucially, only the leader LθL_\theta is trainable; peer models remain fixed throughout. > > ## 2. Learning Algorithm and Objective > > The training procedure consists of two phases: optional supervised fine-tuning (SFT) followed by policy-gradient RL—specifically, Group Relative Policy Optimization (GRPO): > > - Policy Formulation: The leader’s policy is πθ(ox,s)=Lθ(ox,s)\pi_\theta(o|x, \mathbf{s}) = L_\theta(o|x, \mathbf{s}), with s\mathbf{s} the set of peer responses and oo the combined token sequence. > > - Reward Structure: Each leader output trajectory receives a binary scalar reward R{0,1}R \in \{0, 1\}, with R=1R=1 for correctly formatted, correct answers; R=0R=0 otherwise. > > - GRPO Objective: Extending PPO to group settings, for each prompt, GG independent leader outputs {oi}i=1G\{o_i\}_{i=1}^G are sampled. The importance ratio ri,tr_{i,t} is computed per token. The centered advantage A^i,t=RiRˉ\hat{A}_{i,t} = R_i - \bar{R}, with Rˉ\bar{R} the group mean reward, is used. The MLPO (GRPO) loss maximized is: > > JMLPO(θ)=Ex,s,{oi}[1ioii=1Gt=1oimin(ri,tA^i,t,  clip(ri,t,1ϵ,1+ϵ)A^i,t)]\mathcal{J}_{\rm MLPO}(\theta) = \mathbb{E}_{x, \mathbf{s}, \{o_i\}}\left[\frac{1}{\sum_i |o_i|}\sum_{i=1}^G\sum_{t=1}^{|o_i|}\min\left(r_{i,t}\hat{A}_{i,t},\; \mathrm{clip}(r_{i,t}, 1-\epsilon, 1+\epsilon)\hat{A}_{i,t}\right)\right] > > Practical considerations include omitting value networks, using grouped normalization ("Dr.GRPO"), KL-penalization for stability, and ϵ0.1\epsilon \approx 0.1–$0.2$. > > ## 3. Implicit Evaluation and Synthesis Capabilities > > MLPO eschews explicit value networks or peer feedback. Instead, the leader receives its reward signal based on the end-to-end utility of reasoning over the peer set, resulting in: > > - Implicit Peer Response Evaluation: The policy gradient incentivizes the leader to preferentially internalize reasoning patterns that correctly synthesize or reject peer proposals in a reward-maximizing manner. > > - Internalized Criticization: Through GRPO’s centered advantage signals, the leader acquires an implicit value function, enabling it to both synthesize actionable reasoning and self-evaluate the contribution of each peer response. > > ## 4. Implementation Specifics > > - Leader LLM Architecture: Qwen2.5 7B Instruct (~7B params). > > - Peer Agents: Llama 3.1 8B, Gemma2 9B, and an additional Qwen2.5 7B; total “team” size ~24B parameters. > > - Supervised Fine-Tuning (SFT): Instills backtracking via synthetic demonstrations, training the leader to reconcile correct versus incorrect samples, using ~16 leader samples per prompt across ~1 epoch. > > - MLPO Training: For each prompt, 4 responses are sampled per agent, 4 distinct group-prompts are constructed, and “easy” tasks (≥75% peer-correct) are filtered out. Each group sample undergoes 4 leader rollouts. Hyperparameters include a learning rate of 2×1062\times10^{-6}, batch sizes of 64–128, and ~50K–100K gradient steps. > > - Computation: Only LθL_\theta is optimized, reducing FLOPs by approximately a factor $1/K$ versus multi-agent alternatives (e.g., ACC-Collab). At inference, the MLPO system requires KK peer queries and 1 leader query per round—incurring higher per-task token costs than single-agent baselines, yet with significant efficiency improvements over methods requiring all-agents retraining or multi-critic querying. > > ## 5. Empirical Performance and Ablations > > MLPO delivers state-of-the-art results on several benchmarks: > > | Setting | MMLU Accuracy | BBH Accuracy | MATH Accuracy | > |----------------------------|:-------------:|:-------------:|:-------------:| > | SFT+MLPO (5 rounds) | 0.782±0.006 | 0.882±0.005 | 0.762±0.005 | > | Single-agent RL (baseline) | 0.75–0.77 | 0.80–0.83 | 0.70–0.72 | > | MLPO Zero-shot (no peers) | 0.757 | 0.855 | 0.729 | > | Classic GRPO Zero-shot | 0.742 | 0.791 | 0.712 | > > Ablation results include: > > - Multi-round MLPO+ (finetuning on later-round peer data) yields further gains: BBH 0.920, MMLU 0.792, MATH 0.771. > > - Filtering “easy” tasks provides ∼1.3% BBH gain (from 0.869→0.882). > > - Using 4 agent-solution sets per task achieves optimal trade-off compared to 1 or 8 (BBH 0.917 for 4 sets). > > - Leader access to both agent reasoning and final answers outperforms partial information conditions. > > Test-time scaling, under a fixed 40-sample budget, demonstrates 3–5% gains over self-consistency baselines due to peer and leader output diversity. > > ## 6. Efficiency, Limitations, and Extensions > > MLPO’s efficiency derives from: > > - Training just a single leader, avoiding the computational and stability issues of updating all KK agents. > > - No auxiliary critic or value head, leveraging group reward centering for implicit credit assignment. > > - FLOP reduction proportional to $1/K$ compared to multi-agent RL. > > Limitations include: > > - Increased context window size due to concatenating KK peer responses. > > - Sequential peer→leader inference, reducing parallelism and increasing latency. > > - Higher per-prompt memory and compute at inference than single-agent baselines. > > Potential research directions and extensions involve selective/spare peer querying, retrieval and caching of peer solutions, hybrid integration with lightweight differentiable critics, and dynamic adaptation of team size KK or round count TT (Estornell et al., 11 Jul 2025). > > ## 7. Comparative Perspective and Insights > > MLPO demonstrates that hierarchical delegation—training only a single, flexible leader LLM—can subsume both aggregation (“actor”) and evaluation (“critic”) functions usually distributed among agents or value networks. This design choice circumvents the instability and inefficiency of multi-agent or multi-critic systems, while providing improved accuracy and robust generalization. Empirical findings indicate that leader exposure to heterogeneous peer outputs during training provides a richer exploration space even for subsequent single-agent deployment. Limitations related to context size and sequential computation remain areas for methodological innovation.
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-agent guided Leader Policy Optimization (MLPO).