Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-Turn GRPO for Efficient Sequential RL

Updated 6 February 2026
  • The paper introduces mtGRPO, a method that provides dense per-turn, group-relative feedback to address sparse and delayed rewards in multi-turn tasks.
  • It employs group-based normalization at each turn to improve credit assignment, leading to enhanced sample efficiency and convergence in sequential decision environments.
  • The approach is validated across domains such as autonomous driving, tool-integrated reasoning, and multi-agent collaboration, demonstrating significant empirical gains.

Multi-Turn Group Relative Policy Optimization (mtGRPO) generalizes group-based policy optimization to multi-turn sequential decision problems involving complex credit assignment. It provides dense, turn-level, group-relative feedback for each interaction step, supporting sample-efficient and stable reinforcement learning (RL) in scenarios where rewards are sparse or delayed. mtGRPO is a central innovation for high-performance multi-turn reasoning and tool-use with LLMs and multimodal agents, as demonstrated in domains such as autonomous driving, tool-integrated reasoning, and multi-agent collaboration (Li et al., 30 Jan 2026, Zhong et al., 3 Feb 2026, Ding et al., 18 Nov 2025, Hong et al., 17 Nov 2025, Hu et al., 24 Sep 2025).

1. Foundations and Motivation

Classical RL algorithms such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) perform policy gradient updates using trajectory-level or episode-level rewards. In multi-turn interaction settings—where an agent must reason and act iteratively over several turns—rewards are typically sparse, delayed, and not attributable to individual decisions within the sequence. Standard approaches suffer from two key issues: (1) feedback sparsity, in which most actions receive no learning signal, and (2) poor credit assignment, where reward cannot distinguish which turns contributed positively or negatively to the final outcome (Li et al., 30 Jan 2026, Ding et al., 18 Nov 2025).

mtGRPO addresses these limitations by redefining the RL signal at each interaction step: it introduces per-turn reward collection and computes the group-relative advantage for each turn by contrasting the agent’s performance against a batch-level baseline within the same turn. This structure increases the density and specificity of learning signals, substantially enhancing convergence and generalization in long-horizon, multi-turn regimes (Li et al., 30 Jan 2026, Ding et al., 18 Nov 2025).

2. Multi-Turn Decision Process and Algorithmic Formulation

An mtGRPO episode is represented as a TT-turn Markov decision process (MDP), where at each turn t{1,,T}t\in \{1, \dots, T\}, the agent observes a state sts_t, takes an action ata_t, and immediately receives a reward rtr_t. The state comprises all relevant information, including environmental context, interaction history, and turn-wise feedback. The objective is to maximize cumulative discounted rewards: Gt=k=tTγktrkG_t = \sum_{k=t}^T \gamma^{k-t} r_k where γ\gamma is the discount factor (Li et al., 30 Jan 2026, Ding et al., 18 Nov 2025).

mtGRPO proceeds as follows:

  • For each batch of NN rollouts and at each turn tt, the algorithm collects per-turn rewards {ri,t}i=1N\{r_{i,t}\}_{i=1}^N.
  • It computes the turn-specific batch mean bt=(1/N)i=1Nri,tb_t = (1/N)\sum_{i=1}^N r_{i,t} and standard deviation σt\sigma_t.
  • The group-relative advantage for each rollout at turn tt is A^i,t=(ri,tbt)/σt\hat{A}_{i,t} = (r_{i,t} - b_t) / \sigma_t.

The per-token surrogate objective extends the PPO/GRPO loss to sum over both tokens and turns, with the group-relative advantage applied to each action token generated at the corresponding turn: JmtGRPO(θ)=E[1Ni=1N1oit=1oimin(rt(θ)A^i,t,clip(rt(θ),1ϵ,1+ϵ)A^i,t)βDKL[πθπref]]J_{\text{mtGRPO}}(\theta) = \mathbb{E}\Bigg[ \frac{1}{N}\sum_{i=1}^N \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min\Big( r_t(\theta) \hat{A}_{i,t}, \text{clip}(r_t(\theta),1-\epsilon,1+\epsilon)\hat{A}_{i,t} \Big) - \beta D_{\rm KL}[\pi_\theta \| \pi_{\rm ref}] \Bigg] where rt(θ)r_t(\theta) is the importance weight, ϵ\epsilon is the clipping parameter, β\beta is the KL penalty, and πref\pi_{\rm ref} is a reference (e.g., SFT) policy (Li et al., 30 Jan 2026, Ding et al., 18 Nov 2025, Hong et al., 17 Nov 2025).

Pseudocode for mtGRPO (single-agent case) is as follows (Li et al., 30 Jan 2026):

1
2
3
4
5
6
7
8
9
10
11
12
13
for each RL iteration:
    # Sample N rollouts under old policy
    for each rollout i in N:
        for each turn t in T:
            a_i_t = agent.act(s_i_t)
            r_i_t = reward(a_i_t)
    for each turn t:
        b_t = mean([r_i_t for i in 1..N])
        sigma_t = std([r_i_t for i in 1..N])
        for each rollout i:
            A_hat_i_t = (r_i_t - b_t) / sigma_t
    Optimize J_mtGRPO using collected (states, actions, advantages)
    Periodically update reference policy

3. Extensions: Multi-Turn Reasoning, Tool Use, and Multi-Agent Systems

mtGRPO has been extended across several domains, each adapting the group-relative framework to domain-specific challenges.

  • Multi-Turn Trajectory Refinement in Autonomous Driving: In MTDrive, mtGRPO supports an MLLM-based agent performing iterative trajectory refinement, where each turn’s reward is derived from perceptual driving model (PDM) metrics (e.g., collision, drivable area, time-to-collision) and is directly assigned to the tokens generated at that turn, providing efficient and targeted credit assignment (Li et al., 30 Jan 2026).
  • Multi-Turn Tool-Calling Agents: mtGRPO is integrated as Reward-Conditioned GRPO (RC-GRPO), which injects reward-mode tokens (e.g., <|high_reward|>/<|low_reward|>) at each rollout to promote within-group diversity and combat reward variance collapse. This yields robust policy updates even in environments with sparse or bimodal reward distributions (Zhong et al., 3 Feb 2026).
  • Tool-Integrated Reasoning with LLMs: Referred to as Group Turn Policy Optimization (GTPO), the method includes per-turn reward shaping (e.g., partial rewards for code-similar negative trajectories), turn-level returns, and discounting, which leads to superior question-answering and reasoning performance on mathematical benchmarks. Here, the group-relative advantage is computed at each reasoning step (Ding et al., 18 Nov 2025).
  • Hierarchical Multi-Agent Systems: In M-GRPO, the multi-turn framework is extended to handle a main agent (planner) and subordinate tool agents with different frequencies and response times. Group-relative advantages and PPO-style loss are computed separately for each agent, with batch-wide trajectory alignment to maintain synchronization despite stochastic agent invocation counts (Hong et al., 17 Nov 2025).

4. Reward Assignment, Advantage Estimation, and Credit Sharpening Strategies

A central property of mtGRPO is the densification and sharpening of credit assignment via group-relative normalization at each turn.

  • Per-Turn Reward Collection: mtGRPO collects individual step-wise rewards, often aggregating multiple informational channels (e.g., environmental signals, formatting, tool usage metrics), yielding rt=wppt+wfftr_t = w_p \cdot p_t + w_f \cdot f_t in driving (Li et al., 30 Jan 2026), or ri,j=racc+rformatr_{i,j} = r_{\rm acc} + r_{\rm format} in tool reasoning (Ding et al., 18 Nov 2025).
  • Relative Advantage Computation: By normalizing each rollout’s return by the contemporaneous batch mean and standard deviation at the same turn, mtGRPO mitigates low-variance scenarios and ensures sustained gradient updates, even when the policy becomes peaked after extensive SFT (Zhong et al., 3 Feb 2026).
  • Reward Shaping: In tool-based reasoning, partial rewards are awarded for code similarity between failed and successful runs, based on embedding similarity (e.g., Titan Text Embeddings), further densifying the learning signal (Ding et al., 18 Nov 2025).
  • Reward-Conditioned Sampling: In RC-GRPO, explicit conditioning on reward tokens structurally introduces reward variance within each batch group, guaranteeing non-degenerate group normalization and improved advantage spread (Zhong et al., 3 Feb 2026).

5. System-Level Optimizations and Implementation

mtGRPO’s adoption in high-throughput multimodal RL settings necessitates system-level engineering (Li et al., 30 Jan 2026).

  • Inter-Process Streaming Serialization (IPSS) enables immediate tensor serialization and streaming to training workers as soon as a rollout is completed, minimizing idle time and maximizing device utilization.
  • Intra-Process Tensor Cache (IPTC) consolidates multimodal embeddings and tokenization across co-located modules (actor, reference, log-prob computation), reducing redundant deserialization and memory copies.
  • Both optimizations jointly achieve a 2.5×\sim2.5\times speedup in wall-clock training throughput, critical for large-scale RL with high-dimensional inputs and multi-turn sequences.

In multi-agent distributed training, M-GRPO adopts a decoupled pipeline in which each agent operates independently, sharing only scalar reward statistics and trajectory identifiers via a lightweight database, ensuring scalable deployment without cross-server backpropagation or parameter sharing (Hong et al., 17 Nov 2025).

6. Empirical Results and Comparative Performance

mtGRPO and its extensions have demonstrated significant empirical gains across diverse benchmarks:

Domain Baseline (SFT/GRPO) mtGRPO Variant Reported Gain
Autonomous Driving SFT: 88.1, GRPO: 94.2 mtGRPO: 96.2 (oracle) Exceeds VLM-driving/human
Tool Calling SFT+GRPO: 48.75 mtGRPO: 85.00 (Qwen) Surpasses closed-API models
Reasoning (GTPO) GRPO: 49.78 GTPO: 51.26 +3.0% average
Multi-Agent Single: 54-58 M-GRPO: 68-72 +7% over main-only/frozen
Task Planning Larger model: 0%-SR mtGRPO 1.5B: 70%-SR Outperforms 14B models

In ablation studies, per-turn advantage normalization is identified as critical for convergence and non-vanishing policy gradients. Qualitative analyses demonstrate that error correction (e.g., in trajectory rollouts) is localized and successive turns target previously problematic steps (Li et al., 30 Jan 2026, Ding et al., 18 Nov 2025, Hu et al., 24 Sep 2025).

7. Theoretical Guarantees and Application Implications

Under appropriate assumptions (unique minimal-turn expert trajectories, dense verifiable rewards), mtGRPO provides formal guarantees: improvements in the group-based single-turn objective provably translate to higher multi-turn success probabilities and more sample-efficient policies, as established by backward induction arguments and explicit bounds on success probability (Hu et al., 24 Sep 2025). Empirical evidence supports strong cross-task generalization, minimal task completion times, and stable learning curves.

A plausible implication is that future research extending mtGRPO to more open-ended, partially observable, or hierarchical tasks will need to further refine group assignment, baseline computation, and advantage normalization techniques to accommodate the increased complexity of credit assignment.


Key References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Turn Group Relative Policy Optimization (mtGRPO).