MLPO: Multi-agent Leader Policy Optimization
- The paper introduces MLPO, a framework where a single trainable leader LLM optimally synthesizes and evaluates fixed peer outputs for improved collaboration.
- It employs a hierarchical RL structure with Group Relative Policy Optimization, leveraging a group reward mechanism for robust, efficient multi-agent coordination.
- Empirical results demonstrate state-of-the-art accuracy on benchmarks with reduced computational cost compared to co-training all agents.
Multi-agent guided Leader Policy Optimization (MLPO) is a hierarchical reinforcement learning framework for collaborative reasoning in multi-agent LLM systems. MLPO trains a single leader LLM to coordinate a team of fixed, untrained peer agents, optimizing the leader’s ability to evaluate and synthesize their outputs for downstream task accuracy. Unlike prior multi-agent RL approaches that co-train all agents—or rely on explicit value networks or dense peer feedback—MLPO leverages a group reward mechanism to guide the leader’s autoregressive policy, enabling both efficient coordination and improved generalization in both multi-agent and standalone settings (Estornell et al., 11 Jul 2025).
1. Hierarchical Multi-Agent Framework
MLPO structures the system as a hierarchy comprising one trainable leader LLM and off-the-shelf peer agents . Inference proceeds in discrete rounds ( steps), where:
- Round 0: Each peer independently generates a solution for the input prompt . The leader ingests and all and outputs , structured with a
>chain-of-thought and an<answer>tag. > > - Rounds : Each peer refines its previous answer conditioned on the leader’s prior output , producing . The leader then aggregates these into . After rounds, the final answer is read from the leader’s<answer>block. > > Crucially, only the leader is trainable; peer models remain fixed throughout. > > ## 2. Learning Algorithm and Objective > > The training procedure consists of two phases: optional supervised fine-tuning (SFT) followed by policy-gradient RL—specifically, Group Relative Policy Optimization (GRPO): > > - Policy Formulation: The leader’s policy is , with the set of peer responses and the combined token sequence. > > - Reward Structure: Each leader output trajectory receives a binary scalar reward , with for correctly formatted, correct answers; otherwise. > > - GRPO Objective: Extending PPO to group settings, for each prompt, independent leader outputs are sampled. The importance ratio is computed per token. The centered advantage , with the group mean reward, is used. The MLPO (GRPO) loss maximized is: > > > > Practical considerations include omitting value networks, using grouped normalization ("Dr.GRPO"), KL-penalization for stability, and –$0.2$. > > ## 3. Implicit Evaluation and Synthesis Capabilities > > MLPO eschews explicit value networks or peer feedback. Instead, the leader receives its reward signal based on the end-to-end utility of reasoning over the peer set, resulting in: > > - Implicit Peer Response Evaluation: The policy gradient incentivizes the leader to preferentially internalize reasoning patterns that correctly synthesize or reject peer proposals in a reward-maximizing manner. > > - Internalized Criticization: Through GRPO’s centered advantage signals, the leader acquires an implicit value function, enabling it to both synthesize actionable reasoning and self-evaluate the contribution of each peer response. > > ## 4. Implementation Specifics > > - Leader LLM Architecture: Qwen2.5 7B Instruct (~7B params). > > - Peer Agents: Llama 3.1 8B, Gemma2 9B, and an additional Qwen2.5 7B; total “team” size ~24B parameters. > > - Supervised Fine-Tuning (SFT): Instills backtracking via synthetic demonstrations, training the leader to reconcile correct versus incorrect samples, using ~16 leader samples per prompt across ~1 epoch. > > - MLPO Training: For each prompt, 4 responses are sampled per agent, 4 distinct group-prompts are constructed, and “easy” tasks (≥75% peer-correct) are filtered out. Each group sample undergoes 4 leader rollouts. Hyperparameters include a learning rate of , batch sizes of 64–128, and ~50K–100K gradient steps. > > - Computation: Only is optimized, reducing FLOPs by approximately a factor $1/K$ versus multi-agent alternatives (e.g., ACC-Collab). At inference, the MLPO system requires peer queries and 1 leader query per round—incurring higher per-task token costs than single-agent baselines, yet with significant efficiency improvements over methods requiring all-agents retraining or multi-critic querying. > > ## 5. Empirical Performance and Ablations > > MLPO delivers state-of-the-art results on several benchmarks: > > | Setting | MMLU Accuracy | BBH Accuracy | MATH Accuracy | > |----------------------------|:-------------:|:-------------:|:-------------:| > | SFT+MLPO (5 rounds) | 0.782±0.006 | 0.882±0.005 | 0.762±0.005 | > | Single-agent RL (baseline) | 0.75–0.77 | 0.80–0.83 | 0.70–0.72 | > | MLPO Zero-shot (no peers) | 0.757 | 0.855 | 0.729 | > | Classic GRPO Zero-shot | 0.742 | 0.791 | 0.712 | > > Ablation results include: > > - Multi-round MLPO+ (finetuning on later-round peer data) yields further gains: BBH 0.920, MMLU 0.792, MATH 0.771. > > - Filtering “easy” tasks provides ∼1.3% BBH gain (from 0.869→0.882). > > - Using 4 agent-solution sets per task achieves optimal trade-off compared to 1 or 8 (BBH 0.917 for 4 sets). > > - Leader access to both agent reasoning and final answers outperforms partial information conditions. > > Test-time scaling, under a fixed 40-sample budget, demonstrates 3–5% gains over self-consistency baselines due to peer and leader output diversity. > > ## 6. Efficiency, Limitations, and Extensions > > MLPO’s efficiency derives from: > > - Training just a single leader, avoiding the computational and stability issues of updating all agents. > > - No auxiliary critic or value head, leveraging group reward centering for implicit credit assignment. > > - FLOP reduction proportional to $1/K$ compared to multi-agent RL. > > Limitations include: > > - Increased context window size due to concatenating peer responses. > > - Sequential peer→leader inference, reducing parallelism and increasing latency. > > - Higher per-prompt memory and compute at inference than single-agent baselines. > > Potential research directions and extensions involve selective/spare peer querying, retrieval and caching of peer solutions, hybrid integration with lightweight differentiable critics, and dynamic adaptation of team size or round count (Estornell et al., 11 Jul 2025). > > ## 7. Comparative Perspective and Insights > > MLPO demonstrates that hierarchical delegation—training only a single, flexible leader LLM—can subsume both aggregation (“actor”) and evaluation (“critic”) functions usually distributed among agents or value networks. This design choice circumvents the instability and inefficiency of multi-agent or multi-critic systems, while providing improved accuracy and robust generalization. Empirical findings indicate that leader exposure to heterogeneous peer outputs during training provides a richer exploration space even for subsequent single-agent deployment. Limitations related to context size and sequential computation remain areas for methodological innovation.