Multi-task Group Relative Policy Optimization

Updated 30 January 2026

The paper extends standard PPO by applying group-normalized advantage estimation across heterogeneous tasks, enhancing stability and robust credit assignment.
Hierarchical M-GRPO segregates main and sub-agent updates via trajectory alignment, yielding faster convergence and smoother reward curves.
Empirical evaluations demonstrate GRPO variants outperform baselines in sample efficiency and accuracy across multi-turn planning, tool-augmented, and multi-objective settings.

Multi-task Group Relative Policy Optimization (GRPO) refers to a family of reinforcement learning (RL) algorithms that extend the core Group Relative Policy Optimization principle—group-normalized advantage estimation and PPO-style policy updates—across heterogeneous task sets or agent hierarchies. Its central aim is to stabilize and accelerate RL for models, notably LLMs and multi-agent systems, in settings where the policy must optimize over diverse, often disjoint, distributions or modular sub-problems. The most prominent hierarchical variant is M-GRPO, which enables distributed, credit-consistent RL for main/sub-agent architectures. Multi-task variants of GRPO address heterogeneous state-action spaces, reward scales, and policy specialization challenges in domains including multi-turn planning, @@@@1@@@@, vision-language-action learning, and open-domain multi-objective alignment.

1. Foundations: GRPO as a Surrogate for Multi-task Policy Optimization

Group Relative Policy Optimization (GRPO) was introduced to mitigate variance and instability in large-scale RL applications, where standard PPO’s reliance on global/value function baselines leads to poor policy improvement in heterogeneous reasoning environments. The defining operation is—given a batch of $K$ rollouts for the same query or task context—computing a standardized, group-relative advantage: $\hat A_t^g = \frac{R^{(k)} - \mu_q}{\sigma_q}$ where $R^{(k)}$ is the return for rollout $k$ , $\mu_q$ and $\sigma_q$ are the group mean and standard deviation for that query. This advantage replaces the standard value-function baseline in PPO. The clipped surrogate policy loss is: $L^{\text{GRPO}}(\theta) = \mathbb{E}\left[\min\left(r_t(\theta)\hat A_t^g,\ \mathrm{clip}(r_t(\theta), 1 - \varepsilon, 1 + \varepsilon)\hat A_t^g\right)\right]$ with $r_t(\theta)$ denoting the policy likelihood ratio at time $t$ .

In the multi-task setting, each group is associated with a distinct task identity, subtask, or agent, and group-wise normalization ensures per-task learning signals are disentangled, yielding robustness to reward sparsity, delayed credit, and cross-task distributional shifts (Hong et al., 17 Nov 2025, Hu et al., 24 Sep 2025).

2. Hierarchical M-GRPO for Multi-Agent and Modular Systems

A fundamental extension is hierarchical Multi-agent GRPO (M-GRPO), devised for vertical multi-agent systems comprising a top-level planner (main agent, $\mathcal{M}$ ) and multiple specialized sub-agents ( $\mathcal{S}$ ) executing delegated tool-use subproblems. For each query $q$ , $K$ rollouts are generated; the main agent trajectory is

$\tau^{(k)}_{\mathcal M} = \{(s^{(k)}_{\mathcal M,t}, a^{(k)}_{\mathcal M,t}, r^{(k)}_{\mathcal M,t})\}_{t=1}^{T^{(k)}_{\mathcal M}}$

with $d_k$ sub-agent invocations per rollout, each producing its own trajectory $\tau_{\mathcal S_i}^{(k)}$ .

Advantages are computed separately for main and sub-agents:

Main agent:

$\hat A_{q,\mathcal M, t}^{(k)} = \frac{\mathcal{R}_{\mathcal M}(o^{(k)}_{\mathcal M}) - \mu_{q,\mathcal M}}{\sigma_{q,\mathcal M}}$

Sub-agent (after alignment):

$\hat A_{q,\mathcal S, t}^{(k), i} = \frac{\mathcal{R}_{\mathcal S}(o^{(k)}_{\mathcal S_i}) - \mu_{q,\mathcal S}}{\sigma_{q,\mathcal S}}$

Credit assignment is preserved hierarchically by incorporating the main agent’s correctness into sub-agent rewards. Each agent optimizes its own PPO-style clipped loss over the respective group-aligned trajectories. This stratification eliminates the confounding of gradients, prevents sub-optimal minimization across disjoint objectives, and results in scalable, robust learning across agent boundaries (Hong et al., 17 Nov 2025).

3. Trajectory Alignment and Decoupled Distributed Optimization

M-GRPO introduces a trajectory alignment protocol to accommodate the stochastic number of sub-agent invocations per main-agent rollout. All rollouts are padded (via random duplicates) or truncated to a fixed $d$ per batch, yielding a uniform batch tensor and enabling group normalization:

If $d_k < d$ : sub-trajectories are duplicated.
If $d_k > d$ : extra sub-trajectories are randomly discarded.

Agents are executed on separate compute clusters in a decoupled pipeline: only scalar rewards and rollout IDs are exchanged via a shared store, with no cross-agent backpropagation. Each agent independently accumulates gradients for its policy, which supports both scalability and modular deployment without entangling model updates (Hong et al., 17 Nov 2025).

4. Empirical Evaluation Across Multi-task and Hierarchical Domains

M-GRPO has been evaluated on tool-augmented reasoning suites such as GAIA, XBench-DeepSearch, and WebWalkerQA. Baselines include:

Single-agent GRPO: monolithic policy for all tools/tasks.
Multi-agent, main-only GRPO: sub-agents are frozen.

Core findings:

M-GRPO exhibits much smoother reward curves, faster convergence, and consistently outperforms both baselines by 5–10% absolute accuracy throughout the co-training phase.
Trajectory synchronization (alignment) is essential for on-policy correspondence and stability.
Joint optimization of main and sub-agents outperforms variants where sub-agents remain fixed, confirming the necessity of hierarchical, non-frozen credit assignment (Hong et al., 17 Nov 2025).

Multi-task GRPO is also instantiated in domains beyond hierarchical agent architectures:

Parameter Sharing across Heterogeneous MDPs: When tasks are formulated as finite-horizon MDPs with distinct state/action spaces and reward functions, GRPO groups samples by task identity and optimizes a weighted sum of surrogate objectives, using per-group normalization for balanced progress and cross-task generalization (Hu et al., 24 Sep 2025).
Hyperparameter Optimization: GRPOformer applies group-based advantage normalization to hyperparameter search policies over diverse model architectures, synchronizing batch construction across multiple search trails, and combining this with policy churn regularization for stability (Guo et al., 21 Sep 2025).

These frameworks give rise to empirical phenomena such as monotonic improvement guarantees in multi-turn success probabilities and sample efficiency advantages in single-turn decomposed settings (Hu et al., 24 Sep 2025).

6. Theoretical Properties and Practical Significance

Key theoretical elements:

Group normalization naturally balances gradient contributions across diverse tasks, preventing overemphasis on “easier” or high-data regimes and mitigating catastrophic forgetting.
Per-group baselines guarantee, under mild regularity, superior expected returns vis-à-vis any static reference policy and (in strictly decomposable tasks) monotonic end-to-end success amplification by backward induction (Hu et al., 24 Sep 2025).
In hierarchical settings, explicit separation of group baselines for main and sub-agents prevents interaction instabilities associated with shared baselines across non-stationary modules (Hong et al., 17 Nov 2025).

The practical impact includes:

Dramatically improved stability and convergence speed in multi-agent RL for tool-augmented LLMs.
Support for fully decoupled, distributed policy training pipelines necessitated by real-world scalability requirements.
Systematic, group-wise credit assignment that generalizes across state, task, or agent-partitioned domains.

7. Future Directions and Open Considerations

Open research avenues include:

Automatic group or task discovery beyond manual specification, via unsupervised clustering or episodic similarity.
Generalization of trajectory alignment and hierarchical baselining to arbitrary deep modular networks.
Alternatives to reward signal management such as meta-learned scalarization or adaptive variance correction to further stabilize multi-objective or open-domain optimization.

Future work may also refine communication protocols for decentralized agent optimization and explore the unification of M-GRPO with adaptive curriculum learning, off-policy replay, and meta-learning extensions for evolving task distributions (Hong et al., 17 Nov 2025, Hu et al., 24 Sep 2025).

Key References:

“Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO” (Hong et al., 17 Nov 2025)
“Training Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning” (Hu et al., 24 Sep 2025)
“GRPOformer: Advancing Hyperparameter Optimization via Group Relative Policy Optimization” (Guo et al., 21 Sep 2025)