Turn-Aware PPO for Multi-Turn RL
- Turn-Aware PPO is a reinforcement learning algorithm that redefines the Markov Decision Process at the granularity of turns to improve stability.
- It employs turn-level advantage estimation via geometric mean aggregation and a clipped surrogate objective to reduce gradient variance and misaligned credit assignment.
- Empirical evaluations on environments like WebShop and Sokoban show higher cumulative returns, enhanced sample efficiency, and more robust policy improvement compared to token-level PPO.
Turn-Aware PPO (TA-PPO) is a reinforcement learning algorithm designed to address instability and misaligned credit assignment when training LLM agents in multi-turn, interactive task environments. Unlike standard Proximal Policy Optimization (PPO), which operates at the token level, TA-PPO redefines the Markov Decision Process (MDP) at the granularity of turns—where each turn constitutes a full agent response to an environment query. This structural realignment enables lower-variance advantage estimation, robust policy improvement, and stable training dynamics for agentic LLMs in complex multi-turn domains such as web navigation and multi-hop reasoning (Li et al., 18 Dec 2025, Li et al., 25 Nov 2025).
1. Background: Standard PPO and Its Limitations in Multi-Turn Settings
PPO maximizes a clipped surrogate objective to constrain policy updates and ensure stable learning without the computational cost of trust-region approaches. The classical clipped objective is
where is the probability ratio of the new and old policies at token , is the per-token advantage, and controls the update trust region (Schulman et al., 2017). Standard PPO assumes temporally homogeneous transitions (i.e., token-level steps), but multi-turn environments naturally decompose into discrete interaction phases (“turns”) with delayed or sparse rewards and non-stationary transitions. Token-level PPO leads to (1) high-variance advantage estimates, (2) unstable gradient norms, and (3) misaligned importance sampling, which destabilizes training for large LLM agents (Li et al., 18 Dec 2025, Li et al., 25 Nov 2025).
2. Turn-Level MDP Formulation
TA-PPO refactors the RL environment as a turn-level MDP :
- States (): The history of queries and responses up to turn , concatenated with the current query .
- Actions (): The full agent (LLM) response , comprising all tokens in that turn.
- Transition (): The environment supplies the next query after observing agent response.
- Rewards (): Typically for , ; shaped per-token rewards can be summed into .
- Discount (): Applied per turn, so returns from turn are .
- Termination (): Fixed horizon or upon completion (e.g., “buy” in WebShop, puzzle solved in Sokoban).
This granularity allows the agent and critic to operate at the same decision timescale, reducing temporal mismatch and variance in policy-gradient estimation (Li et al., 18 Dec 2025).
3. Mathematical Derivation: Turn-Level Advantage Estimation and Surrogate Objective
TA-PPO applies Generalized Advantage Estimation (GAE) at the turn level:
- Temporal-difference error:
- Turn advantage:
The surrogate objective replaces per-token ratios with turn-wise ratios:
Clipping is applied once per turn:
The critic is trained on squared error to the turn-return.
4. Turn-Level Importance Sampling and Algorithmic Structure
TA-PPO introduces turn-level importance weights by grouping tokens of a turn and assigning a geometric-mean weight:
or equivalently,
Turn-level advantage is aggregated per turn. The Turn-PPO update loop involves:
- Sampling trajectories and collecting rewards/advantages at turn boundaries.
- Computing turn-wise weights and advantages.
- Optimizing the clipped surrogate objective and critic loss via minibatch SGD (Li et al., 18 Dec 2025, Li et al., 25 Nov 2025).
5. Empirical Findings and Comparative Performance
Empirical evaluations on WebShop and Sokoban demonstrate that TA-PPO:
- Achieves higher cumulative returns and success rates versus GRPO and token-level PPO.
- Successfully stabilizes runs where GRPO crashes.
- Maintains clipping ratios near , supporting more conservative updates.
- Outperforms baselines in sample efficiency—requiring fewer environment steps for target performance.
Selected results:
| Environment | Model | GRPO | Token-PPO | Turn-PPO |
|---|---|---|---|---|
| WebShop | Qwen2.5-3B | 0.72 | 0.73 | 0.75 |
| WebShop | Qwen3-1.7B (t.) | Crash | 0.54 | 0.55 |
| Sokoban | Qwen2.5-3B | Crash | 1.93 | 2.29 |
| Sokoban | Qwen2.5-7B | Crash | 2.90 | 3.74 |
Ablation studies confirm that per-turn clipping yields higher stability than per-token approaches (Li et al., 18 Dec 2025).
6. Theoretical Motivation: Variance Reduction and Credit Assignment
TA-PPO’s design is motivated by:
- Granularity alignment: Agentic LLM tasks exhibit natural turn-based decomposition.
- Variance reduction: Geometric mean aggregation of token-level ratios smooths policy updates, reducing the likelihood of gradient spikes and clipping bias growth.
- Improved credit assignment: Turn-level updates produce a balanced trade-off between trajectory-level and token-level credit—a structure shown to empirically stabilize learning (Li et al., 25 Nov 2025).
7. Practical Implementation Guidelines and Limitations
Recommended practices for TA-PPO include:
- Employing turn-level MDPs when environment responses consist of large text chunks.
- Initializing the value head from the same pretrained model as the policy for rapid critic convergence.
- Setting the critic learning rate 5–10× higher than the actor’s to accelerate critic accuracy.
- Tuning at the turn level (, typical) for appropriate bias–variance control.
- Using smaller batch sizes and fewer epochs to prevent overfitting.
- Batch diversity (one episode per query) is typically sufficient for scenario coverage.
Current validations are limited to text-only environments; extending TA-PPO to richer tool-use, multi-modal, or hierarchical RL settings presents open research directions (Li et al., 18 Dec 2025). Automatic turn segmentation remains an unresolved challenge when environment boundaries are implicit.
8. Conclusion
TA-PPO successfully reconciles the decision granularity of interactive language agents with the RL optimization framework, yielding stable, sample-efficient training and robust policy improvement in multi-turn domains. By restructuring both advantage calculation and importance weighting at the turn level, this methodology overcomes instability endemic to token-level PPO, specifically for long-horizon, sparse-reward environments characteristic of agentic LLM deployments (Li et al., 18 Dec 2025, Li et al., 25 Nov 2025).