Papers
Topics
Authors
Recent
Search
2000 character limit reached

Turn-Aware PPO for Multi-Turn RL

Updated 14 January 2026
  • Turn-Aware PPO is a reinforcement learning algorithm that redefines the Markov Decision Process at the granularity of turns to improve stability.
  • It employs turn-level advantage estimation via geometric mean aggregation and a clipped surrogate objective to reduce gradient variance and misaligned credit assignment.
  • Empirical evaluations on environments like WebShop and Sokoban show higher cumulative returns, enhanced sample efficiency, and more robust policy improvement compared to token-level PPO.

Turn-Aware PPO (TA-PPO) is a reinforcement learning algorithm designed to address instability and misaligned credit assignment when training LLM agents in multi-turn, interactive task environments. Unlike standard Proximal Policy Optimization (PPO), which operates at the token level, TA-PPO redefines the Markov Decision Process (MDP) at the granularity of turns—where each turn constitutes a full agent response to an environment query. This structural realignment enables lower-variance advantage estimation, robust policy improvement, and stable training dynamics for agentic LLMs in complex multi-turn domains such as web navigation and multi-hop reasoning (Li et al., 18 Dec 2025, Li et al., 25 Nov 2025).

1. Background: Standard PPO and Its Limitations in Multi-Turn Settings

PPO maximizes a clipped surrogate objective to constrain policy updates and ensure stable learning without the computational cost of trust-region approaches. The classical clipped objective is

LCLIP(θ)=Et[min(rt(θ)A^t,  clip(rt(θ),1ε,1+ε)A^t)]L^{\rm CLIP}(θ) = E_t\Big[\min\big(r_t(θ)\,\hat A_t,\; \mathrm{clip}(r_t(θ),1-ε,1+ε)\,\hat A_t\big)\Big]

where rt(θ)r_t(θ) is the probability ratio of the new and old policies at token tt, A^t\hat A_t is the per-token advantage, and ϵ\epsilon controls the update trust region (Schulman et al., 2017). Standard PPO assumes temporally homogeneous transitions (i.e., token-level steps), but multi-turn environments naturally decompose into discrete interaction phases (“turns”) with delayed or sparse rewards and non-stationary transitions. Token-level PPO leads to (1) high-variance advantage estimates, (2) unstable gradient norms, and (3) misaligned importance sampling, which destabilizes training for large LLM agents (Li et al., 18 Dec 2025, Li et al., 25 Nov 2025).

2. Turn-Level MDP Formulation

TA-PPO refactors the RL environment as a turn-level MDP (S,A,T,R,γ,H)(S, A, T, R, γ, H):

  • States (sts_t): The history of queries and responses up to turn t1t-1, concatenated with the current query QtQ_t.
  • Actions (ata_t): The full agent (LLM) response RtR_t, comprising all tokens in that turn.
  • Transition (TT): The environment supplies the next query Qt+1Q_{t+1} after observing agent response.
  • Rewards (rtr_t): Typically rt=0r_t = 0 for t<Nt < N, rN=rfinalr_N = r_{\rm final}; shaped per-token rewards can be summed into rtr_t.
  • Discount (γγ): Applied per turn, so returns from turn tt are Gt=k=tNγktrkG_t = \sum_{k=t}^N γ^{k-t} r_k.
  • Termination (HH): Fixed horizon or upon completion (e.g., “buy” in WebShop, puzzle solved in Sokoban).

This granularity allows the agent and critic to operate at the same decision timescale, reducing temporal mismatch and variance in policy-gradient estimation (Li et al., 18 Dec 2025).

3. Mathematical Derivation: Turn-Level Advantage Estimation and Surrogate Objective

TA-PPO applies Generalized Advantage Estimation (GAE) at the turn level:

  • Temporal-difference error: δt=rt+γVϕ(st+1)Vϕ(st)δ_t = r_t + γ V_ϕ(s_{t+1}) - V_ϕ(s_t)
  • Turn advantage:

A^tturn=l=0Nt1(γλ)lδt+l\hat A^{\rm turn}_t = \sum_{l=0}^{N-t-1} (γλ)^l δ_{t+l}

The surrogate objective replaces per-token ratios with turn-wise ratios:

rt(θ)=πθ(atst)πθold(atst)r_t(θ) = \frac{π_θ(a_t|s_t)}{π_{θ_{\rm old}}(a_t|s_t)}

Clipping is applied once per turn:

Lactor(θ)=1Mi,tmin(rti(θ)A^ti,  clip(rti(θ),1ε,1+ε)A^ti)L^{\rm actor}(θ) = -\frac{1}{M} \sum_{i,t} \min\big(r^i_t(θ)\hat A^i_t,\; \mathrm{clip}(r^i_t(θ),1-ε,1+ε)\hat A^i_t \big)

The critic is trained on squared error to the turn-return.

4. Turn-Level Importance Sampling and Algorithmic Structure

TA-PPO introduces turn-level importance weights by grouping tokens of a turn yky^k and assigning a geometric-mean weight:

wkturn(θ)=(πθ(ykx,y<k)πθold(ykx,y<k))1/ykw_k^{\rm turn}(θ) = \left( \frac{π_θ(y^k | x, y^{<k})}{π_{θ_{\rm old}}(y^k | x, y^{<k})} \right)^{1/|y^k|}

or equivalently,

wkturn(θ)=exp(1ykt=tkstarttkendlogπθ(ytx,y<t)πθold(ytx,y<t))w_k^{\rm turn}(θ) = \exp\left(\frac{1}{|y^k|}\sum_{t=t_k^{\rm start}}^{t_k^{\rm end}} \log \frac{π_θ(y_t|x,y_{<t})}{π_{θ_{\rm old}}(y_t|x,y_{<t})}\right)

Turn-level advantage is aggregated per turn. The Turn-PPO update loop involves:

  • Sampling trajectories and collecting rewards/advantages at turn boundaries.
  • Computing turn-wise weights and advantages.
  • Optimizing the clipped surrogate objective and critic loss via minibatch SGD (Li et al., 18 Dec 2025, Li et al., 25 Nov 2025).

5. Empirical Findings and Comparative Performance

Empirical evaluations on WebShop and Sokoban demonstrate that TA-PPO:

  • Achieves higher cumulative returns and success rates versus GRPO and token-level PPO.
  • Successfully stabilizes runs where GRPO crashes.
  • Maintains clipping ratios near [1ϵ,1+ϵ][1-\epsilon, 1+\epsilon], supporting more conservative updates.
  • Outperforms baselines in sample efficiency—requiring fewer environment steps for target performance.

Selected results:

Environment Model GRPO Token-PPO Turn-PPO
WebShop Qwen2.5-3B 0.72 0.73 0.75
WebShop Qwen3-1.7B (t.) Crash 0.54 0.55
Sokoban Qwen2.5-3B Crash 1.93 2.29
Sokoban Qwen2.5-7B Crash 2.90 3.74

Ablation studies confirm that per-turn clipping yields higher stability than per-token approaches (Li et al., 18 Dec 2025).

6. Theoretical Motivation: Variance Reduction and Credit Assignment

TA-PPO’s design is motivated by:

  • Granularity alignment: Agentic LLM tasks exhibit natural turn-based decomposition.
  • Variance reduction: Geometric mean aggregation of token-level ratios smooths policy updates, reducing the likelihood of gradient spikes and clipping bias growth.
  • Improved credit assignment: Turn-level updates produce a balanced trade-off between trajectory-level and token-level credit—a structure shown to empirically stabilize learning (Li et al., 25 Nov 2025).

7. Practical Implementation Guidelines and Limitations

Recommended practices for TA-PPO include:

  • Employing turn-level MDPs when environment responses consist of large text chunks.
  • Initializing the value head from the same pretrained model as the policy for rapid critic convergence.
  • Setting the critic learning rate 5–10× higher than the actor’s to accelerate critic accuracy.
  • Tuning (γ,λ)(γ,λ) at the turn level (γ=0.99γ=0.99, λ=0.9λ=0.9 typical) for appropriate bias–variance control.
  • Using smaller batch sizes and fewer epochs to prevent overfitting.
  • Batch diversity G=1G=1 (one episode per query) is typically sufficient for scenario coverage.

Current validations are limited to text-only environments; extending TA-PPO to richer tool-use, multi-modal, or hierarchical RL settings presents open research directions (Li et al., 18 Dec 2025). Automatic turn segmentation remains an unresolved challenge when environment boundaries are implicit.

8. Conclusion

TA-PPO successfully reconciles the decision granularity of interactive language agents with the RL optimization framework, yielding stable, sample-efficient training and robust policy improvement in multi-turn domains. By restructuring both advantage calculation and importance weighting at the turn level, this methodology overcomes instability endemic to token-level PPO, specifically for long-horizon, sparse-reward environments characteristic of agentic LLM deployments (Li et al., 18 Dec 2025, Li et al., 25 Nov 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Turn-Aware PPO (TA-PPO).