Turn-Aware PPO for Multi-Turn RL

Updated 14 January 2026

Turn-Aware PPO is a reinforcement learning algorithm that redefines the Markov Decision Process at the granularity of turns to improve stability.
It employs turn-level advantage estimation via geometric mean aggregation and a clipped surrogate objective to reduce gradient variance and misaligned credit assignment.
Empirical evaluations on environments like WebShop and Sokoban show higher cumulative returns, enhanced sample efficiency, and more robust policy improvement compared to token-level PPO.

Turn-Aware PPO (TA-PPO) is a reinforcement learning algorithm designed to address instability and misaligned credit assignment when training LLM agents in multi-turn, interactive task environments. Unlike standard Proximal Policy Optimization (PPO), which operates at the token level, TA-PPO redefines the Markov Decision Process (MDP) at the granularity of turns—where each turn constitutes a full agent response to an environment query. This structural realignment enables lower-variance advantage estimation, robust policy improvement, and stable training dynamics for agentic LLMs in complex multi-turn domains such as web navigation and multi-hop reasoning (Li et al., 18 Dec 2025, Li et al., 25 Nov 2025).

1. Background: Standard PPO and Its Limitations in Multi-Turn Settings

PPO maximizes a clipped surrogate objective to constrain policy updates and ensure stable learning without the computational cost of trust-region approaches. The classical clipped objective is

$L^{\rm CLIP}(θ) = E_t\Big[\min\big(r_t(θ)\,\hat A_t,\; \mathrm{clip}(r_t(θ),1-ε,1+ε)\,\hat A_t\big)\Big]$

where $r_t(θ)$ is the probability ratio of the new and old policies at token $t$ , $\hat A_t$ is the per-token advantage, and $\epsilon$ controls the update trust region (Schulman et al., 2017). Standard PPO assumes temporally homogeneous transitions (i.e., token-level steps), but multi-turn environments naturally decompose into discrete interaction phases (“turns”) with delayed or sparse rewards and non-stationary transitions. Token-level PPO leads to (1) high-variance advantage estimates, (2) unstable gradient norms, and (3) misaligned importance sampling, which destabilizes training for large LLM agents (Li et al., 18 Dec 2025, Li et al., 25 Nov 2025).

2. Turn-Level MDP Formulation

TA-PPO refactors the RL environment as a turn-level MDP $(S, A, T, R, γ, H)$ :

States ( $s_t$ ): The history of queries and responses up to turn $t-1$ , concatenated with the current query $Q_t$ .
Actions ( $a_t$ ): The full agent (LLM) response $r_t(θ)$ 0, comprising all tokens in that turn.
Transition ( $r_t(θ)$ 1): The environment supplies the next query $r_t(θ)$ 2 after observing agent response.
Rewards ( $r_t(θ)$ 3): Typically $r_t(θ)$ 4 for $r_t(θ)$ 5, $r_t(θ)$ 6; shaped per-token rewards can be summed into $r_t(θ)$ 7.
Discount ( $r_t(θ)$ 8): Applied per turn, so returns from turn $r_t(θ)$ 9 are $t$ 0.
Termination ( $t$ 1): Fixed horizon or upon completion (e.g., “buy” in WebShop, puzzle solved in Sokoban).

This granularity allows the agent and critic to operate at the same decision timescale, reducing temporal mismatch and variance in policy-gradient estimation (Li et al., 18 Dec 2025).

3. Mathematical Derivation: Turn-Level Advantage Estimation and Surrogate Objective

TA-PPO applies Generalized Advantage Estimation (GAE) at the turn level:

Temporal-difference error: $t$ 2
Turn advantage:

$t$ 3

The surrogate objective replaces per-token ratios with turn-wise ratios:

$t$ 4

Clipping is applied once per turn:

$t$ 5

The critic is trained on squared error to the turn-return.

4. Turn-Level Importance Sampling and Algorithmic Structure

TA-PPO introduces turn-level importance weights by grouping tokens of a turn $t$ 6 and assigning a geometric-mean weight:

$t$ 7

or equivalently,

$t$ 8

Turn-level advantage is aggregated per turn. The Turn-PPO update loop involves:

Sampling trajectories and collecting rewards/advantages at turn boundaries.
Computing turn-wise weights and advantages.
Optimizing the clipped surrogate objective and critic loss via minibatch SGD (Li et al., 18 Dec 2025, Li et al., 25 Nov 2025).

5. Empirical Findings and Comparative Performance

Empirical evaluations on WebShop and Sokoban demonstrate that TA-PPO:

Achieves higher cumulative returns and success rates versus GRPO and token-level PPO.
Successfully stabilizes runs where GRPO crashes.
Maintains clipping ratios near $t$ 9, supporting more conservative updates.
Outperforms baselines in sample efficiency—requiring fewer environment steps for target performance.

Selected results:

Environment	Model	GRPO	Token-PPO	Turn-PPO
WebShop	Qwen2.5-3B	0.72	0.73	0.75
WebShop	Qwen3-1.7B (t.)	Crash	0.54	0.55
Sokoban	Qwen2.5-3B	Crash	1.93	2.29
Sokoban	Qwen2.5-7B	Crash	2.90	3.74

Ablation studies confirm that per-turn clipping yields higher stability than per-token approaches (Li et al., 18 Dec 2025).

6. Theoretical Motivation: Variance Reduction and Credit Assignment

TA-PPO’s design is motivated by:

Granularity alignment: Agentic LLM tasks exhibit natural turn-based decomposition.
Variance reduction: Geometric mean aggregation of token-level ratios smooths policy updates, reducing the likelihood of gradient spikes and clipping bias growth.
Improved credit assignment: Turn-level updates produce a balanced trade-off between trajectory-level and token-level credit—a structure shown to empirically stabilize learning (Li et al., 25 Nov 2025).

7. Practical Implementation Guidelines and Limitations

Recommended practices for TA-PPO include:

Employing turn-level MDPs when environment responses consist of large text chunks.
Initializing the value head from the same pretrained model as the policy for rapid critic convergence.
Setting the critic learning rate 5–10× higher than the actor’s to accelerate critic accuracy.
Tuning $\hat A_t$ 0 at the turn level ( $\hat A_t$ 1, $\hat A_t$ 2 typical) for appropriate bias–variance control.
Using smaller batch sizes and fewer epochs to prevent overfitting.
Batch diversity $\hat A_t$ 3 (one episode per query) is typically sufficient for scenario coverage.

Current validations are limited to text-only environments; extending TA-PPO to richer tool-use, multi-modal, or hierarchical RL settings presents open research directions (Li et al., 18 Dec 2025). Automatic turn segmentation remains an unresolved challenge when environment boundaries are implicit.

8. Conclusion

TA-PPO successfully reconciles the decision granularity of interactive language agents with the RL optimization framework, yielding stable, sample-efficient training and robust policy improvement in multi-turn domains. By restructuring both advantage calculation and importance weighting at the turn level, this methodology overcomes instability endemic to token-level PPO, specifically for long-horizon, sparse-reward environments characteristic of agentic LLM deployments (Li et al., 18 Dec 2025, Li et al., 25 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (3)

Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs (2025)

ST-PPO: Stabilized Off-Policy Proximal Policy Optimization for Multi-Turn Agents Training (2025)

Proximal Policy Optimization Algorithms (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Turn-Aware PPO (TA-PPO).

Turn-Aware PPO for Multi-Turn RL

1. Background: Standard PPO and Its Limitations in Multi-Turn Settings

2. Turn-Level MDP Formulation

3. Mathematical Derivation: Turn-Level Advantage Estimation and Surrogate Objective

4. Turn-Level Importance Sampling and Algorithmic Structure

5. Empirical Findings and Comparative Performance

6. Theoretical Motivation: Variance Reduction and Credit Assignment

7. Practical Implementation Guidelines and Limitations

8. Conclusion

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Turn-Aware PPO for Multi-Turn RL

1. Background: Standard PPO and Its Limitations in Multi-Turn Settings

2. Turn-Level MDP Formulation

3. Mathematical Derivation: Turn-Level Advantage Estimation and Surrogate Objective

4. Turn-Level Importance Sampling and Algorithmic Structure

5. Empirical Findings and Comparative Performance

6. Theoretical Motivation: Variance Reduction and Credit Assignment

7. Practical Implementation Guidelines and Limitations

8. Conclusion

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research