Process Relative Policy Optimization (PRPO)
- PRPO is a critic-free policy optimization framework that combines outcome rewards with dense process guidance to enable stable training for long-horizon reasoning tasks.
- It employs entropy-based segmentation to partition token sequences into logical reasoning units, ensuring precise local credit assignment through statistical advantage alignment.
- Empirical results show that PRPO improves pass@1 accuracy by up to 7 points on benchmarks like MATH500, AMC2023, and AIME, verifying its practical effectiveness.
Process Relative Policy Optimization (PRPO) is a critic-free policy optimization framework designed to address the limitations of outcome-only and process-only reward assignments in LLM training, especially for multi-step reasoning tasks with sparse final rewards and dense intermediate feedback. PRPO achieves robust fine-grained credit assignment by fusing outcome reliability and process-level guidance through token-level segmentation and statistical advantage alignment, thereby circumventing premature output truncation and achieving superior empirical performance in mathematical reasoning benchmarks (Ding et al., 12 Jan 2026).
1. Motivation and Background
Traditional critic-free policy optimization methods such as Generalized Reward Policy Optimization (GRPO) broadcast a single normalized final outcome reward to every token in a sampled trajectory , with the outcome advantage defined as , where is a sparse indicator of answer correctness. This approach induces uniform credit over all intermediate steps, resulting in high-gradient variance and poor credit allocation across long reasoning chains. Moreover, it cannot distinguish valid reasoning trajectories that lead to the same correct answer, limiting its effectiveness in tasks requiring granular stepwise supervision.
Process Reward Models (PRMs) such as Qwen2.5-Math-PRM-7B provide dense token- or segment-level feedback, typically via soft correctness scores . However, relative normalization (e.g., ) can result in strongly negative advantages in early sequence segments if those are low-probability under the PRM, and only a single positive advantage later. This dynamic can drive policy gradients in expectation toward truncated outputs, leading to the "premature collapse" phenomenon: if many early tokens have negative mean advantage and a late token has , but , then the expected policy improvement on the prefix is negative, causing output collapse. Thus, PRMs alone destabilize critic-free policy learning in long-horizon tasks.
The solution is to combine outcome-based reliability with process-level guidance, leveraging the low-variance outcome advantage as a distributional "anchor" and using dense process rewards for local credit assignment. PRPO's central mechanism is a location-parameter shift, ensuring critic-free training remains stable while preventing collapse and yielding efficient trajectory-level learning.
2. Sequence Segmentation and Semantic Splitting
PRPO decomposes each token sequence rollout of length into contiguous segments . The segmentation is performed using entropy spikes in the token probability distribution, computed at each decoding step as . The top- entropy spike indices, separated by at least tokens, are selected as semantic cut points to define segment boundaries . This segmentation aligns closely with logical reasoning units such as proof steps or equation transformations, enabling more precise assignment of process rewards within each segment.
3. Mathematical Structure of Token-Level Advantages
Let each PRPO trajectory be segmented, and let each segment be assigned a process reward score by the oracle PRM. The token-level process reward is then for . This is normalized using a fixed prior mean and standard deviation , resulting in . The sparse trajectory-wide outcome advantage is . PRPO then shifts every token's process advantage by the outcome advantage:
The resulting per-token fused advantage, , is used in the policy gradient. This alignment guarantees that the expected fused advantage mean remains non-negative for correct trajectories, forestalling collapse, and ensures moderate variance across all tokens.
4. PRPO Algorithm and Operational Workflow
The PRPO algorithm operates in a critic-free fashion, using no value network. The workflow consists of:
- Rollout Generation: For each prompt, sample trajectories using the current policy model , recording reference log-probabilities.
- Entropy-Based Segmentation: For each trajectory, compute token-level entropies, select segmentation points by entropy spikes with minimum spacing, and define segment boundaries.
- PRM Evaluation: Compute segment process scores using the PRM oracle.
- Token-Level Advantage Computation: Assign and normalize to ; calculate for the trajectory; fuse to obtain the per-token advantage .
- Policy Gradient Update: Minimize the loss , with optional KL regularization.
This procedure is executed without a value network. For Qwen2.5-Math-1.5B, PRPO achieves efficient and robust fine-tuning with only eight rollouts per prompt. Hyperparameters include batch size 128, learning rate , KL coefficient 0.001, clip ratio 0.2, and entropy-k split 5.
5. Empirical Performance and Ablation
Experimental results on the MATH500 test set demonstrate that PRPO improves pass@1 accuracy from 61.2% (GRPO baseline) to 64.4%. When using PRM-Avg (average process + outcome rewards) the accuracy is 63.6%, and with a combined PRM-Avg+PRPO assignment, 66.0% is achieved. On AMC2023, AIME2024/25, and with the 7B variant, PRPO yields consistent improvements of 2–7 points in pass@k metrics over GRPO and process-only baselines.
Ablation studies confirm two key findings:
- Entropy-based semantic segmentation is essential—random and uniform splits yield markedly inferior gains (2.4% and 29.8%) compared to 64.4% for entropy-based splits.
- Predefined process normalization (using fixed and ) avoids collapse and achieves stable improvements, whereas relative normalization can induce late collapse.
6. Theoretical Insights and Comparative Analysis
PRPO's theoretical foundation includes an analysis of the collapse phenomenon in pure process-only policy gradients: when early segment advantages are heavily negative, even a positive terminal advantage cannot prevent expected gradient decline on the trajectory prefix, which truncates outputs. Introducing the outcome shift and aligning the fused advantage distribution ensures that for correct trajectories, the mean fused advantage is always non-negative, thus avoiding collapse and excessive variance.
PRPO is critic-free—no value network is used—in contrast to actor-critic methods. The fine-grained token rewards from process guidance yield low-variance, stable updates, while broadcasting outcome advantages provides a "hard anchor" for global coherence. Thus, PRPO offers practical and theoretically robust fine-tuning for long-horizon, multi-step reasoning models.
7. Implementation Specifics and Applicability
The implementation of PRPO for mathematical reasoning tasks utilizes Qwen2.5-Math-1.5B as the policy model and Qwen2.5-Math-PRM-7B as the process reward oracle. Training is performed on the MATH training split (12,000 problems) and evaluated on robust external competitions (AMC2023, AIME2024/25). Eight H200 GPUs are used to facilitate batch training with (max tokens) 2048 per sequence. PRPO is broadly applicable to LLM fine-tuning in settings characterized by sparse outcome rewards and where intermediate logical guidance via process reward models is desired. The fusion of outcome reliability and dense process-level guidance eliminates the detrimental effects of premature collapse and enables efficient large-batch, high-token policy updates without a separate value estimator (Ding et al., 12 Jan 2026).