Turn-level Stage-aware Policy Optimization
- The paper introduces TSPO, a reinforcement learning paradigm that decomposes agent tasks into semantically meaningful stages to overcome credit assignment challenges.
- TSPO employs per-stage reward shaping and turn-level importance sampling to reduce gradient variance and prevent process homogenization in both LLMs and robotic tasks.
- Empirical results show that TSPO significantly enhances training stability and performance, yielding superior success rates compared to traditional token- or trajectory-level methods.
Turn-level Stage-aware Policy Optimization (TSPO) is a family of reinforcement learning and fine-tuning techniques that address the limitations of token-level or trajectory-level optimization in structured, long-horizon tasks. TSPO operates by decomposing complex behaviors into semantically meaningful "stages" or "turns," aligning optimization and credit assignment with these inherently structured subunits. This approach delivers superior stability, interpretability, and credit assignment granularity, particularly in multi-turn reasoning with LLMs and long-horizon robotic manipulation with vision-language-action (VLA) models (Xu et al., 4 Dec 2025, Ma et al., 30 Jan 2026, Li et al., 25 Nov 2025).
1. Motivation and Theoretical Basis
Traditional RL fine-tuning of LLM agents and policy optimization for robots often use outcome-level, per-trajectory, or token-level RL objectives. This can induce two primary issues:
- Process Homogenization: Sequential subgoals (e.g., reasoning steps, motion phases) are ignored, collapsing all process paths into uniform reward signals, preventing effective credit assignment to partial successes or sub-stage completion (Ma et al., 30 Jan 2026).
- Intra-Group Homogenization: In group-based PPO variants (e.g., GRPO), binary or sparse outcome rewards frequently lead to entire RL mini-batches with zero reward variance. This causes advantage estimates and gradients to vanish, stalling learning.
- Granularity Mismatch and Instability: Token-level importance sampling in multi-turn environments produces high-variance gradients and allocation misaligned with turn- or stage-wise boundaries, causing PPO collapse or spurious updates in LLMs (Li et al., 25 Nov 2025).
- Sparse or Uninformative Feedback: In robotic trajectories, end-to-end or trajectory-level optimization signals are too coarse to assign credit accurately along semantic sub-stages (e.g., reaching, grasping, placing).
TSPO techniques systematically decompose trajectories into stages or turns, provide local (per-stage, per-turn) reward or preference signals, and apply optimization at the corresponding granularity, thus restoring fine-grained feedback and stabilizing the learning process.
2. Formulation and Algorithms
2.1 Stage/Turn Decomposition
TSPO introduces automatic decomposition of agent-environment interaction trajectories:
- State Partitioning: For multi-turn LLMs, the state at turn is , with denoting agent logic/thought and the result of action (e.g., search feedback) (Ma et al., 30 Jan 2026). For robotics, a stage assignment function maps each timestep to a semantic stage such as "Reach," "Grasp," etc. (Xu et al., 4 Dec 2025).
- Event-based Stage Boundaries: In manipulation, geometric event rules (e.g., contact, threshold crossing) determine stage transitions (Xu et al., 4 Dec 2025). In dialogue/reasoning, turns are delimited by agent or environment interaction events (Li et al., 25 Nov 2025).
2.2 Per-Stage and Turn-Level Reward Assignment
TSPO variants employ mechanisms such as:
- First-Occurrence Latent Reward (FOLR): Assigns a reward at the first turn where ground-truth information appears in feedback, enabling partial credit for "near-miss" trajectories (Ma et al., 30 Jan 2026).
- Dense Geometry-based Stage Cost: Computes per-stage geometric costs (e.g., mean distance to goal in manipulation), incorporated in penalized preference or RL losses (Xu et al., 4 Dec 2025).
- Potential-based Reward Shaping: In staged PPO for VLA, each time step receives an augmented reward , where is a stage-specific potential (Xu et al., 4 Dec 2025).
2.3 Optimization Objectives
- Stage-aware TPO (StA-TPO): For each trajectory pair and each stage , compute a stage-level preference score and penalized score . The stage-level DPO-style loss is:
- Stage-aware PPO (StA-PPO): PPO surrogate objectives are computed using per-stage/turn importance ratios and advantage estimates; potential-based shaping is incorporated as described above (Xu et al., 4 Dec 2025, Li et al., 25 Nov 2025).
- Group normalization for Per-Turn PPO: For a group of sampled trajectories per question, per-turn advantages are turn-wise standardized:
2.4 Stabilization Techniques
To further address the instability of standard PPO:
- Turn-Level Importance Sampling: Replaces per-token ratios with geometric means over tokens in a “turn,” reducing importance weight variance arising from variable turn lengths or structure (Li et al., 25 Nov 2025).
- Clipping-Bias Correction: The surrogate gradient is rescaled by the norm of the clipping-bias correction term, downweighting unreliable, highly off-policy samples (Li et al., 25 Nov 2025).
3. Empirical Performance and Benchmarks
TSPO techniques have been empirically validated across several domains:
| Domain | Task(s) | Model | Best Baseline | TSPO Variant | Best Reported Metric | Source |
|---|---|---|---|---|---|---|
| Robotic Manipulation | SimplerEnv, ManiSkill3 | OpenVLA-7B | PPO | IPI (SFT→StA-TPO→StA-PPO) | 98.0% (SimplerEnv), 96.4% (ManiSkill3) | (Xu et al., 4 Dec 2025) |
| Multi-Turn QA (Search) | NQ, HotpotQA, 2Wiki, etc. | Qwen2.5-7B | Search-R1 | TSPO (FOLR PPO) | 0.444 EM (+13.6%) | (Ma et al., 30 Jan 2026) |
| Multi-Turn Reasoning/Dialogue | NQ, HotpotQA, MedQA, MedMCQA | Llama-3.1 8B | Token PPO | ST-PPO (TSPO) | up to +5 pts EM, +4–6 pts accuracy | (Li et al., 25 Nov 2025) |
Key findings include:
- Stage-aware signals and turn-level optimization yield substantial absolute and relative gains over both standard PPO and previous trajectory-level preference baselines.
- Multi-stage shaping and FOLR mechanisms confer resilience against vanishing gradients and outcome reward sparsity.
- Disabling stage shaping in critical phases degrades performance by >20 points, underscoring the contribution of per-stage credit signals (Xu et al., 4 Dec 2025).
- TSPO confers higher policy entropy, lower KL drift, and smoother training curves (Ma et al., 30 Jan 2026).
4. Procedural Pipelines and Implementation
4.1 Imitation → Preference → Interaction (IPI) Pipeline (for VLA)
A serial three-phase regimen:
- Imitation (SFT): Supervised fine-tuning on demonstrations establishes a safe policy prior.
- Preference (StA-TPO): Offline stage-aware preference optimization with DPO objectives applies per-stage preference labels and penalizes geometric costs.
- Interaction (StA-PPO): On-policy RL with per-stage potential-based shaping consolidates final performance (Xu et al., 4 Dec 2025).
4.2 TSPO in LLM-based Multi-Turn Tool Use
- Mini-batch questions, sampled rollouts per question.
- For each rollout: identify first-occurrence of correct answer in retrieved passages, assign FOLR per-turn rewards, compute per-turn or trajectory-level advantages as indicated by group reward variance, and update via PPO with KL penalty (Ma et al., 30 Jan 2026).
- Per-turn boundaries in LLMs are inferred via message delimiter tokens or agent-environment dialogue segmentation (e.g., loss_mask or <eot>) (Li et al., 25 Nov 2025).
- All implementations report distributed training with large batch sizes (e.g., 8×H100 GPUs) for scalability.
5. Extensions, Limitations, and Generalization
TSPO techniques are generalizable to any RL or fine-tuning scenario where multi-phase processes or stage boundaries exist. Further developments include:
- Domain Generalization: Applicable to multi-agent coordination (agents assigned to temporally ordered tasks), curriculum learning (curriculum over stage weights/penalties), and settings with tool-use or API calls (Xu et al., 4 Dec 2025).
- FOLR Generalization: While effective for evidence retrieval (where the answer is present in feedback), FOLR requires adaptation for pure reasoning tasks or other tool modalities (Ma et al., 30 Jan 2026).
- Potential Extensions: Integrating learned process reward models, adapting first-occurrence signals to multimodal or multi-tool settings, and theoretical study of sample complexity improvements relative to conventional sparse-reward approaches (Ma et al., 30 Jan 2026).
- Stabilization: Clipping-bias correction is complementary to stage/turn decomposition and should be integrated to mitigate off-policy instability in large-scale RL (Li et al., 25 Nov 2025).
6. Interpretability and Diagnostics
A key benefit of TSPO is the interpretability of per-stage or per-turn feedback:
- Stage-level costs or FOLR occurrences yield actionable diagnostics (e.g., stage-wise failure rates in manipulation, detection of “near-miss” trajectories in QA).
- Credits and penalties can be tuned at per-stage granularity, facilitating curriculum construction and troubleshooting of agents’ performance bottlenecks (Xu et al., 4 Dec 2025).
7. Summary of Key Variants
The TSPO paradigm encompasses several concrete algorithmic instantiations:
| Variant | Setting | Key Features | Reference |
|---|---|---|---|
| StA-TPO | VLA manipulation, offline RL | Stage-level DPO, stage cost-penalized preferences | (Xu et al., 4 Dec 2025) |
| StA-PPO | VLA manipulation, on-policy RL | Per-stage potential shaping in PPO | (Xu et al., 4 Dec 2025) |
| FOLR-TSPO | Multi-turn QA with LLMs | First-occurrence latent rewards at turn-level | (Ma et al., 30 Jan 2026) |
| ST-PPO | Multi-turn LLM agent training | Turn-level importance sampling, clipping-bias corr. | (Li et al., 25 Nov 2025) |
These variants share the central mechanism of decomposing agent-environment interactions into turns/stages and aligning reward assignment, credit, and optimization accordingly. The result is a set of robust, interpretable, and data-efficient learning protocols that have demonstrated significant performance and stability improvements in their respective domains.