GTPO: Fine-Grained Policy Optimization

Updated 9 February 2026

GTPO is a reinforcement learning framework that assigns rewards at token, turn, and trajectory levels for precise policy optimization in LLMs.
It employs entropy shaping, conflict correction, and normalization to stabilize updates and reduce gradient variance in complex tasks.
Empirical results show GTPO’s superiority over GRPO, yielding higher reward plateaus and robust performance in long-chain and multi-turn reasoning.

GTPO (Group Token/Turn/Trajectory Policy Optimization) refers to a class of reinforcement learning algorithms developed for fine-grained credit assignment and stable policy optimization in LLMs, particularly in the context of complex reasoning, multi-turn interactions, and tool-augmented scenarios. The unifying goal across these methods is to overcome the limitations of prior group-based policy optimizers such as GRPO (Group Relative Policy Optimization), which rely on coarse sequence- or trajectory-level reward assignment that impedes effective learning in long-chain reasoning or multi-turn environments.

1. Conceptual Overview

GTPO denotes several closely related methods sharing the following high-level principles:

Fine-grained credit assignment: Unlike standard GRPO, which propagates the same scalar reward to all tokens or turns in a response, GTPO variants allocate rewards or advantage signals at the token, turn, or trajectory level, often using additional shaping or normalization.
Stabilization of policy updates: Advanced GTPO implementations address known issues in GRPO such as token-level gradient conflicts, policy entropy blow-up, and variance due to sparse or delayed rewards.
Applications: GTPO has been instantiated for long-chain mathematical reasoning, multi-turn tool-integrated reasoning, long-horizon dialogue with global constraints, and general alignment of LLMs for robust task performance in variable environments (Tan et al., 6 Aug 2025, Ding et al., 18 Nov 2025, Shen et al., 2 Feb 2026, Simoni et al., 5 Aug 2025).

2. Major GTPO Algorithmic Variants

2.1 Group Token Policy Optimization (Token-level, Entropy-weighted)

As presented in "GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy" (Tan et al., 6 Aug 2025), Group Token Policy Optimization (GTPO) introduces dynamic entropy-weighted reward shaping at the token level. For each token $o_{i, t}$ in a batch of successful sequence generations, an entropy-shaped reward is assigned:

$\tilde{x}_{i, t} = r_i + \alpha \cdot \left( \frac{H_{i, t}}{\sum_{k=1}^n H_{k, t}} \right) \cdot d_t$

where $H_{i, t}$ is the token-level entropy, $d_t$ is the count of still-alive sequences at step $t$ , and $\alpha$ controls the entropy bonus. Token-level advantages are then computed and used in a PPO-Clip style loss. High-entropy ("uncertain") tokens in correct sequences receive larger shaping rewards, focusing learning on critical reasoning steps.

2.2 Group Turn Policy Optimization (Turn-level, Return-based, Self-supervised Shaping)

For multi-turn tasks (TIR, dialogue), GTPO assigns rewards and advantages per turn rather than per sequence (Ding et al., 18 Nov 2025, Shen et al., 2 Feb 2026). Distinguishing features:

Turn-level rewards $r_{i,j}$ reflect answer accuracy, tool-format correctness, or global constraint satisfaction at each turn.
Discounted returns $R_{i,j}$ are normalized across groups for advantage estimation:

$\widehat{A}_{i, j} = \frac{R_{i, j} - \mathrm{mean}_{i'}\{R_{i', j}\}}{\mathrm{std}_{i'}\{R_{i', j}\}}$

Self-supervised shaping provides partial credit for "near-miss" completions by code-similarity metrics, densifying sparse binary rewards.
Reward differencing highlights incremental local improvements in long dialogues and robustifies global constraint adherence (Shen et al., 2 Feb 2026).

2.3 Group-relative Trajectory-based Policy Optimization (Conflict-aware, Entropy-filtered)

The GTPO formulation in (Simoni et al., 5 Aug 2025) identifies token-level update conflicts by detecting tokens that appear at the same position in positive- and negative-advantage completions ("conflict tokens"). The loss function reweights:

Conflict tokens in negative completions: updates are skipped,
Conflict tokens in positive completions: updates are doubled,
All other tokens: standard updates. Additionally, high-entropy trajectories (where average per-token entropy exceeds $\ln 2$ ) are filtered, and an entropy penalty term encourages low entropy. This eliminates the need for a KL-divergence term and a reference policy.

3. Detailed Algorithmic Procedures

3.1 Token/Turn-level Reward Assignment and Normalization

Token-level: Entropy is computed for each token; rewards are shaped using dynamic weighting and batch normalization (Tan et al., 6 Aug 2025).
Turn-level: For each turn in each rollout, rewards are assigned based on task-specific correctness and auxiliary criteria. Discounted returns are computed forward from each turn, then group-normalized (Ding et al., 18 Nov 2025, Shen et al., 2 Feb 2026).

3.2 Policy Loss and Optimization

All GTPO variants employ a PPO-style clipped surrogate objective, substituting the corresponding fine-grained advantages (token, turn, trajectory) in place of sequence-level signals. Importance weights are constructed using the ratio of new to old policy probabilities.
Hyperparameters such as entropy weighting, group size, and clipping bounds are tuned for task and stability.

3.3 Entropy-Based Filtering and Regularization

Entropy filters ensure that trajectories exceeding a critical entropy threshold ( $\ln 2$ ) are masked out if the model was initially low-entropy (Simoni et al., 5 Aug 2025).
An explicit negative entropy term can be included in the loss to maintain distributional sharpness.

4. Empirical Evaluation and Performance

A synthesis of results across GTPO implementations:

Study	Benchmark(s)	Baseline(s)	Main GTPO Gain(s)
(Tan et al., 6 Aug 2025)	Qwen2.5-32B/Math	DAPO, GRPO-S	+10–15pp mean reward plateau; deeper CoT outputs
(Ding et al., 18 Nov 2025)	AIME, MATH500, AMC	GRPO	+3.0pp avg pass rate (TIR); ablation: –2.7pp w/o turn rewards
(Simoni et al., 5 Aug 2025)	GSM8K, MATH, AIME2024	SFT, GRPO	+4–5pp absolute on pass@k; stable training, no collapse
(Shen et al., 2 Feb 2026)	TRIP-Bench	SFT, GRPO, Gemini-3-Pro	+17pp (loose), +18pp (strict) vs SFT; best on constraint satisfaction

Empirically, GTPO consistently surpasses both SFT and GRPO-style baselines, yielding superior reward plateaus, out-of-distribution generalization, robust formatting, and constraint satisfaction across both single-turn and long-horizon interactive tasks.

5. Theoretical Properties and Ablation Insights

Variance reduction: Token- and turn-level normalization lowers policy gradient variance compared to sequence-level baselines (Tan et al., 6 Aug 2025).
Gradient rescaling: Entropy shaping produces a rescaled yet aligned gradient with respect to standard DAPO, focusing updates on informative (high-uncertainty) steps.
Conflict correction: Conflict-aware update rules prevent destructive averaging over structurally critical tokens, preserving response format and stability (Simoni et al., 5 Aug 2025).
Shaping efficacy: Ablations show that reward shaping and advantage granularity are the principal drivers of learning improvements; disabling these reverts performance to GRPO-level.

6. Applications and Limitations

GTPO algorithms have been deployed for:

Mathematical and chain-of-thought reasoning,
Multi-turn tool-integrated reasoning tasks (TIR),
Long-horizon, constraint-driven interactions in real-world scenarios (e.g., travel planning in TRIP-Bench).

Limitations reported include increased computational overhead due to per-token/turn grouping and normalization, reliance on user-simulator fidelity in RL environments, and continued sensitivity to reward shaping hyperparameters. Resource considerations are pronounced for longer contexts and larger base models (Shen et al., 2 Feb 2026).

Relative to GRPO:

GTPO enables true fine-grained credit assignment and greater policy stability, and—in conflict-aware variants—can discard reference policies entirely by using direct entropy penalization and filtering.
Sequence-level and stepwise baselines (DAPO, PPO, RLHF) demonstrate lower asymptotic performance and exhibit late-stage collapse in reward and formatting metrics.

A plausible implication is that GTPO's modular approach—combining conflict correction, entropy shaping, and partial self-supervised credit—offers a blueprint for further advances in LLM alignment, especially for tasks with inherently long-horizon or compositional structure.

Key References:

(Tan et al., 6 Aug 2025) Tan & Pan, "GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy"
(Simoni et al., 5 Aug 2025) Sun et al., "GTPO: Trajectory-Based Policy Optimization in LLMs"
(Ding et al., 18 Nov 2025) Zhang et al., "Empowering Multi-Turn Tool-Integrated Reasoning with Group Turn Policy Optimization"
(Shen et al., 2 Feb 2026) Cheng et al., "TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios"