Papers
Topics
Authors
Recent
Search
2000 character limit reached

RLPR: Reference Probability Reward in RL

Updated 4 February 2026
  • RLPR is a verifier-free reinforcement learning approach that derives rewards from a model’s intrinsic probability estimates on reference outputs.
  • It aggregates token-level probabilities using the arithmetic mean and employs debiasing and adaptive filtering to stabilize training.
  • Empirical evaluations show RLPR outperforms traditional verifier-based methods, boosting accuracy across mathematical and general-domain benchmarks.

Reinforcement Learning with Reference Probability Reward (RLPR) is a verifier-free reinforcement learning framework for LLMs in which the reward signal is derived from the model's own intrinsic probability estimates for reference (ground-truth) outputs. RLPR generalizes prior work on Reinforcement Learning with Verifiable Rewards (RLVR), extending its applicability from mathematics and programming domains—where symbolic verifiers are feasible—to unconstrained general-domain reasoning, comprehension, and open-ended tasks. By leveraging a model-native, fine-grained estimator of answer quality, RLPR achieves robust, scalable reward assignment without reliance on external verifiers or reward models, while integrating stabilization techniques to manage the high variance of such probability-based rewards (Yu et al., 23 Jun 2025).

1. Motivation and Foundational Concepts

Traditional RL for LLMs employs either rule-based verifiers (e.g., programmatic checkers, code execution engines) or trained reward models to produce scalar rewards. RLVR, in particular, advanced LLM reasoning ability by using such verifiers; however, the approach faces intrinsic limitations:

  • Verifier brittleness: Rule-based verifiers require extensive engineering and do not generalize to the linguistic diversity of natural language outputs.
  • Domain restriction: Neural verifier models require costly dataset annotation and tend to have poor generalization outside narrowly defined domains.
  • Scalability: Integrating verifier logic adds engineering complexity and computational overhead, impeding deployment on a wide range of tasks.

RLPR eliminates the need for external verifiers by using a pre-trained LLM’s own token-level output probabilities on reference answers as a scalar reward. This mechanism enables reinforcement learning for compositional and open-domain tasks, facilitating the extension of RLVR architectures to domains such as reading comprehension, planning, and generalized question answering.

2. Mathematical Framework and Reward Definition

Let xx denote the prompt (question), zz a sampled “chain-of-thought” reasoning sequence, and yy the final answer. The policy parameterization πθ\pi_\theta generates an output o=(zy)o = (z \,\|\, y), and the reward assignment comprises the following steps (Yu et al., 23 Jun 2025):

  1. Per-token probabilities: For each token oio'_i in the reference sequence o=(zy)o' = (z \,\|\, y^*), where yy^* is the ground-truth, the policy assigns a probability pi=pθ(oio<i,x)p_i = p_\theta(o'_i\,|\,o'_{<i}, x).
  2. Aggregation: Token probabilities associated with the reference answer yy^* are aggregated into a scalar reward, rθ(x,y)=fseq({pioiy})r_\theta(x, y) = f_{seq}(\{p_i\,|\,o'_i\in y^*\}).
    • Naive aggregation via geometric mean (product) yields high variance due to collapse on low-probability tokens.
    • Empirically preferred: arithmetic mean per-token probability (PR), defined as fseqmean=1yipif_{seq}^{mean} = \frac{1}{|y^*|}\sum_i p_i, correlates best with answer correctness and yields more stable gradients.
  3. Expected reward maximization: The objective is to maximize

J(θ)=Eoπθ(x)[rθ(x,y)]J(\theta) = \mathbb{E}_{o\sim\pi_\theta(\cdot|x)}[r_\theta(x, y)]

with gradients estimated using the standard REINFORCE estimator:

θJ(θ)=Eoπθ(x)[rθ(x,y)θlogπθ(ox)]\nabla_\theta J(\theta) = \mathbb{E}_{o\sim\pi_\theta(\cdot|x)}[r_\theta(x, y) \nabla_\theta \log \pi_\theta(o|x)]

This directly connects policy optimization to model-native reward estimation without intermediate reward models or external annotation.

3. Stabilization and Variance Control Techniques

RLPR addresses two key sources of instability in the reference probability reward:

  • Reward debiasing: The mean per-token probability raw score rr conflates answer quality with prompt difficulty and reference phrasing. To remove static biases, a base reward rr' is computed by evaluating the LLM on the reference answer yy^* alone (without reasoning zz). The debiased reward is then

r^=clip[0,1](rr)\hat{r} = \operatorname{clip}_{[0,1]}(r - r')

ensuring stability and comparability across prompts and confining the reward to [0,1][0,1].

  • Standard deviation filtering (adaptive curriculum): To avoid degenerate updates from uninformative samples (e.g., trivially easy or impossible prompts), the standard deviation σr\sigma_r of reward values per prompt is tracked. Prompts with σr<β\sigma_r < \beta (threshold updated via exponential moving average) are filtered out from training, leading to more informative sample selection and preserving effective exploration.

An ablation study confirms each component's necessity: omitting token-probability reward, debiasing, or standard deviation filtering yields systematically lower performance and less stable optimization (Yu et al., 23 Jun 2025).

4. Training Procedure and Implementation

The RLPR algorithm adopts a GRPO (Generalized Reward Policy Optimization)-style policy gradient, analogous to PPO with clip-based objectives. The policy update incorporates both reward and adaptive KL terms to prevent policy collapse. The key steps are as follows (Yu et al., 23 Jun 2025):

  • Initialize πθ\pi_\theta (pre-trained LLM), set β\beta (e.g., 0.5)
  • Per iteration:
  1. Sample a batch of MM prompts {xj}\{x_j\} and for each, KK rollouts.
  2. For each rollout, compute the raw per-token reward rj,kr_{j,k}, the base score rjr'_j, and the debiased, clipped reward r^j,k\hat{r}_{j,k}.
  3. Compute the per-prompt reward standard deviation σj\sigma_j; filter out prompts where σj<β\sigma_j < \beta.
  4. Update the policy by optimizing the PPO-clip surrogate:

    LPPO=E[min(r^ρ(θ), r^clip(ρ(θ),1ϵ,1+ϵ))]L_{PPO} = \mathbb{E}\left[ \min(\hat{r} \cdot \rho(\theta),\ \hat{r} \cdot \operatorname{clip}(\rho(\theta), 1-\epsilon, 1+\epsilon)) \right]

    where ρ(θ)=πθ(ox)/πθold(ox)\rho(\theta) = \pi_\theta(o|x) / \pi_{\theta_{old}}(o|x), and ϵ0.2\epsilon \approx 0.2.

  5. Update β\beta with exponential moving average.

Typical hyperparameter settings include batch sizes of 768 rollouts (96 prompts × 8 samples), learning rate 5×1075\times10^{-7}, four policy updates per rollout, and entropy penalty 1×1031\times10^{-3} (Yu et al., 23 Jun 2025).

5. Empirical Results and Comparative Performance

RLPR has been evaluated across seven public benchmarks: four general-domain (MMLU-Pro, GPQA, TheoremQA, WebInstruct) and three mathematical (MATH-500, Minerva, AIME24), reporting accuracy in average@k as specified by each task (Yu et al., 23 Jun 2025). Representative results for Qwen2.5-7B are:

Model Verifier TheoremQA Minerva Overall
Qwen2.5-7B (base) 41.4 37.6 40.9
VeriFree (ℓ-reference) 47.6 49.0 49.4
General Reasoner (model) 1.5B verifier 52.1 51.7 52.0
RLPR 55.4 56.5 53.6

Key findings include:

  • RLPR outperforms VeriFree by +7.8 points on TheoremQA and +7.5 points on Minerva.
  • RLPR outperforms General Reasoner (using a dedicated 1.5B verifier) by +4.8 points on Minerva and +1.3 overall.
  • On general-domain benchmarks, RLPR yields a +3.5 point average gain over the Qwen2.5-7B base model.
  • Gains are robust across architectures: Llama3.1-8B, Gemma2-2B, etc.

Ablation experiments show sharp drops in TheoremQA/Minerva average accuracy when any component is removed:

  • No token-prob reward: 33.5 / 34.2
  • No debiasing: 52.7 / 54.1
  • No std-filtering: 52.5 / 55.1
  • Full RLPR: 55.4 / 56.5

6. Connections, Limitations, and Future Directions

RLPR contrasts both with rule-based reward RL and with “reference-based reward model” approaches such as Cooper/VerifyRM, which employ an explicit reward model Rφ(x,r,y)R_\varphi(x, r, y) trained on positive/negative (x,r,y)(x, r, y) triples, with the reference answer rr provided as context and the label derived by hybrid annotation and contrastive updates (Hong et al., 7 Aug 2025). Cooper’s approach, while effective at suppressing reward hacking and yielding a strong reward model (89.4% accuracy on VerifyBench-Math), is distinct from RLPR in that Cooper maintains an external reward model rather than relying on intrinsic model probabilities.

A comparable line of research, “Replacing Rewards with Examples” (RCE), frames RL objectives using success examples and recursive classification, but targets classical RL environments rather than LLMs (Eysenbach et al., 2021).

RLPR’s primary advantages are:

  • End-to-end scalability to arbitrary domains without the need for task-specific verifier code or reward model annotation.
  • Robust, stable training via internal probability normalization and adaptive data filtering.
  • Superior empirical performance, even surpassing state-of-the-art verifier-based approaches on multiple benchmarks (Yu et al., 23 Jun 2025).

Current limitations include the requirement for ground-truth reference answers and non-trivial extension to unsupervised or multimodal tasks. Future work may focus on integrating self-generated references, scaling to larger parameter regimes, or developing self-supervised variants for settings lacking references (Yu et al., 23 Jun 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Reinforcement Learning with Reference Probability Reward (RLPR).