Papers
Topics
Authors
Recent
Search
2000 character limit reached

Process Relative Policy Optimization (PRPO)

Updated 19 January 2026
  • PRPO is a critic-free policy optimization framework that combines outcome rewards with dense process guidance to enable stable training for long-horizon reasoning tasks.
  • It employs entropy-based segmentation to partition token sequences into logical reasoning units, ensuring precise local credit assignment through statistical advantage alignment.
  • Empirical results show that PRPO improves pass@1 accuracy by up to 7 points on benchmarks like MATH500, AMC2023, and AIME, verifying its practical effectiveness.

Process Relative Policy Optimization (PRPO) is a critic-free policy optimization framework designed to address the limitations of outcome-only and process-only reward assignments in LLM training, especially for multi-step reasoning tasks with sparse final rewards and dense intermediate feedback. PRPO achieves robust fine-grained credit assignment by fusing outcome reliability and process-level guidance through token-level segmentation and statistical advantage alignment, thereby circumventing premature output truncation and achieving superior empirical performance in mathematical reasoning benchmarks (Ding et al., 12 Jan 2026).

1. Motivation and Background

Traditional critic-free policy optimization methods such as Generalized Reward Policy Optimization (GRPO) broadcast a single normalized final outcome reward to every token in a sampled trajectory τ\tau, with the outcome advantage defined as Aoutcome(τ)=[Routcome(τ)μrollout]/σrolloutA_{\text{outcome}}(\tau) = [R_{\text{outcome}}(\tau) - \mu_{\text{rollout}}] / \sigma_{\text{rollout}}, where RoutcomeR_{\text{outcome}} is a sparse indicator of answer correctness. This approach induces uniform credit over all intermediate steps, resulting in high-gradient variance and poor credit allocation across long reasoning chains. Moreover, it cannot distinguish valid reasoning trajectories that lead to the same correct answer, limiting its effectiveness in tasks requiring granular stepwise supervision.

Process Reward Models (PRMs) such as Qwen2.5-Math-PRM-7B provide dense token- or segment-level feedback, typically via soft correctness scores rt[0,1]r_t \in [0,1]. However, relative normalization (e.g., At=rtμprocessA_t = r_t - \mu_{\text{process}}) can result in strongly negative advantages in early sequence segments if those are low-probability under the PRM, and only a single positive advantage later. This dynamic can drive policy gradients in expectation toward truncated outputs, leading to the "premature collapse" phenomenon: if many early tokens have negative mean advantage a-a and a late token tt^* has +b+b, but at>ba \cdot t^* > b, then the expected policy improvement on the prefix is negative, causing output collapse. Thus, PRMs alone destabilize critic-free policy learning in long-horizon tasks.

The solution is to combine outcome-based reliability with process-level guidance, leveraging the low-variance outcome advantage as a distributional "anchor" and using dense process rewards for local credit assignment. PRPO's central mechanism is a location-parameter shift, ensuring critic-free training remains stable while preventing collapse and yielding efficient trajectory-level learning.

2. Sequence Segmentation and Semantic Splitting

PRPO decomposes each token sequence rollout τ\tau of length TT into MM contiguous segments si=[ti1,ti)s_i = [t_{i-1}, t_i). The segmentation is performed using entropy spikes in the token probability distribution, computed at each decoding step as Ek=vπ(xk=vx<k)logπ(xk=vx<k)E_k = -\sum_v \pi(x_k = v | x_{<k}) \log \pi(x_k = v | x_{<k}). The top-kk entropy spike indices, separated by at least mm tokens, are selected as semantic cut points to define segment boundaries (0,c1),(c1,c2),...,(cM1,T)(0, c_1), (c_1, c_2), ..., (c_{M-1}, T). This segmentation aligns closely with logical reasoning units such as proof steps or equation transformations, enabling more precise assignment of process rewards within each segment.

3. Mathematical Structure of Token-Level Advantages

Let each PRPO trajectory τ\tau be segmented, and let each segment sis_i be assigned a process reward score rprocessi[0,1]r_{\text{process}}^i \in [0,1] by the oracle PRM. The token-level process reward is then RPRM(t;τ):=rprocessiR_{\text{PRM}}(t; \tau) := r_{\text{process}}^i for tsit \in s_i. This is normalized using a fixed prior mean μprior,process=0.5\mu_{\text{prior,process}} = 0.5 and standard deviation σprior,process=1/120.289\sigma_{\text{prior,process}} = 1/\sqrt{12} \approx 0.289, resulting in APRM(t;τ)=(RPRM(t;τ)μprior,process)/σprior,processA_{\text{PRM}}(t; \tau) = (R_{\text{PRM}}(t; \tau) - \mu_{\text{prior,process}})/\sigma_{\text{prior,process}}. The sparse trajectory-wide outcome advantage is Aoutcome(τ)=Routcome(τ)μrolloutA_{\text{outcome}}(\tau) = R_{\text{outcome}}(\tau) - \mu_{\text{rollout}}. PRPO then shifts every token's process advantage by the outcome advantage:

APRM(t;τ):=APRM(t;τ)+Aoutcome(τ).A'_{\text{PRM}}(t; \tau) := A_{\text{PRM}}(t; \tau) + A_{\text{outcome}}(\tau).

The resulting per-token fused advantage, AFt(τ)=APRM(t;τ)+Aoutcome(τ)AF_t(\tau) = A_{\text{PRM}}(t; \tau) + A_{\text{outcome}}(\tau), is used in the policy gradient. This alignment guarantees that the expected fused advantage mean remains non-negative for correct trajectories, forestalling collapse, and ensures moderate variance across all tokens.

4. PRPO Algorithm and Operational Workflow

The PRPO algorithm operates in a critic-free fashion, using no value network. The workflow consists of:

  1. Rollout Generation: For each prompt, sample RR trajectories using the current policy model πθ\pi_\theta, recording reference log-probabilities.
  2. Entropy-Based Segmentation: For each trajectory, compute token-level entropies, select segmentation points by entropy spikes with minimum spacing, and define segment boundaries.
  3. PRM Evaluation: Compute segment process scores rprocessir_{\text{process}}^i using the PRM oracle.
  4. Token-Level Advantage Computation: Assign RPRM(t)R_{\text{PRM}}(t) and normalize to APRM(t)A_{\text{PRM}}(t); calculate AoutcomeA_{\text{outcome}} for the trajectory; fuse to obtain the per-token advantage AFtAF_t.
  5. Policy Gradient Update: Minimize the loss L=Ei,t[AFt(i)log(πθ(xt(i)x<t(i))/πref(xt(i)x<t(i)))]L = -\mathbb{E}_{i,t}[AF_t^{(i)} \cdot \log (\pi_\theta(x_t^{(i)}|x_{<t}^{(i)}) / \pi_{\text{ref}}(x_t^{(i)}|x_{<t}^{(i)}))], with optional KL regularization.

This procedure is executed without a value network. For Qwen2.5-Math-1.5B, PRPO achieves efficient and robust fine-tuning with only eight rollouts per prompt. Hyperparameters include batch size 128, learning rate 1×1061\times 10^{-6}, KL coefficient 0.001, clip ratio 0.2, and entropy-k split 5.

5. Empirical Performance and Ablation

Experimental results on the MATH500 test set demonstrate that PRPO improves pass@1 accuracy from 61.2% (GRPO baseline) to 64.4%. When using PRM-Avg (average process + outcome rewards) the accuracy is 63.6%, and with a combined PRM-Avg+PRPO assignment, 66.0% is achieved. On AMC2023, AIME2024/25, and with the 7B variant, PRPO yields consistent improvements of 2–7 points in pass@k metrics over GRPO and process-only baselines.

Ablation studies confirm two key findings:

  • Entropy-based semantic segmentation is essential—random and uniform splits yield markedly inferior gains (2.4% and 29.8%) compared to 64.4% for entropy-based splits.
  • Predefined process normalization (using fixed μ=0.5\mu=0.5 and σ=0.289\sigma=0.289) avoids collapse and achieves stable improvements, whereas relative normalization can induce late collapse.

6. Theoretical Insights and Comparative Analysis

PRPO's theoretical foundation includes an analysis of the collapse phenomenon in pure process-only policy gradients: when early segment advantages are heavily negative, even a positive terminal advantage cannot prevent expected gradient decline on the trajectory prefix, which truncates outputs. Introducing the outcome shift δ=Aoutcome\delta = A_{\text{outcome}} and aligning the fused advantage distribution ensures that for correct trajectories, the mean fused advantage is always non-negative, thus avoiding collapse and excessive variance.

PRPO is critic-free—no value network is used—in contrast to actor-critic methods. The fine-grained token rewards from process guidance yield low-variance, stable updates, while broadcasting outcome advantages provides a "hard anchor" for global coherence. Thus, PRPO offers practical and theoretically robust fine-tuning for long-horizon, multi-step reasoning models.

7. Implementation Specifics and Applicability

The implementation of PRPO for mathematical reasoning tasks utilizes Qwen2.5-Math-1.5B as the policy model and Qwen2.5-Math-PRM-7B as the process reward oracle. Training is performed on the MATH training split (12,000 problems) and evaluated on robust external competitions (AMC2023, AIME2024/25). Eight H200 GPUs are used to facilitate batch training with (max tokens) 2048 per sequence. PRPO is broadly applicable to LLM fine-tuning in settings characterized by sparse outcome rewards and where intermediate logical guidance via process reward models is desired. The fusion of outcome reliability and dense process-level guidance eliminates the detrimental effects of premature collapse and enables efficient large-batch, high-token policy updates without a separate value estimator (Ding et al., 12 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Process Relative Policy Optimization (PRPO).