RLPR: Reference Probability Reward in RL

Updated 4 February 2026

RLPR is a verifier-free reinforcement learning approach that derives rewards from a model’s intrinsic probability estimates on reference outputs.
It aggregates token-level probabilities using the arithmetic mean and employs debiasing and adaptive filtering to stabilize training.
Empirical evaluations show RLPR outperforms traditional verifier-based methods, boosting accuracy across mathematical and general-domain benchmarks.

Reinforcement Learning with Reference Probability Reward (RLPR) is a verifier-free reinforcement learning framework for LLMs in which the reward signal is derived from the model's own intrinsic probability estimates for reference (ground-truth) outputs. RLPR generalizes prior work on Reinforcement Learning with Verifiable Rewards (RLVR), extending its applicability from mathematics and programming domains—where symbolic verifiers are feasible—to unconstrained general-domain reasoning, comprehension, and open-ended tasks. By leveraging a model-native, fine-grained estimator of answer quality, RLPR achieves robust, scalable reward assignment without reliance on external verifiers or reward models, while integrating stabilization techniques to manage the high variance of such probability-based rewards (Yu et al., 23 Jun 2025).

1. Motivation and Foundational Concepts

Traditional RL for LLMs employs either rule-based verifiers (e.g., programmatic checkers, code execution engines) or trained reward models to produce scalar rewards. RLVR, in particular, advanced LLM reasoning ability by using such verifiers; however, the approach faces intrinsic limitations:

Verifier brittleness: Rule-based verifiers require extensive engineering and do not generalize to the linguistic diversity of natural language outputs.
Domain restriction: Neural verifier models require costly dataset annotation and tend to have poor generalization outside narrowly defined domains.
Scalability: Integrating verifier logic adds engineering complexity and computational overhead, impeding deployment on a wide range of tasks.

RLPR eliminates the need for external verifiers by using a pre-trained LLM’s own token-level output probabilities on reference answers as a scalar reward. This mechanism enables reinforcement learning for compositional and open-domain tasks, facilitating the extension of RLVR architectures to domains such as reading comprehension, planning, and generalized question answering.

2. Mathematical Framework and Reward Definition

Let $x$ denote the prompt (question), $z$ a sampled “chain-of-thought” reasoning sequence, and $y$ the final answer. The policy parameterization $\pi_\theta$ generates an output $o = (z \,\|\, y)$ , and the reward assignment comprises the following steps (Yu et al., 23 Jun 2025):

Per-token probabilities: For each token $o'_i$ in the reference sequence $o' = (z \,\|\, y^*)$ , where $y^*$ is the ground-truth, the policy assigns a probability $p_i = p_\theta(o'_i\,|\,o'_{<i}, x)$ .
Aggregation: Token probabilities associated with the reference answer $y^*$ $y^{*}$ are aggregated into a scalar reward, $z$ $z$ 0.
- Naive aggregation via geometric mean (product) yields high variance due to collapse on low-probability tokens.
- Empirically preferred: arithmetic mean per-token probability (PR), defined as $z$ 1, correlates best with answer correctness and yields more stable gradients.
Expected reward maximization: The objective is to maximize

$z$ 2

with gradients estimated using the standard REINFORCE estimator:

$z$ 3

This directly connects policy optimization to model-native reward estimation without intermediate reward models or external annotation.

3. Stabilization and Variance Control Techniques

RLPR addresses two key sources of instability in the reference probability reward:

Reward debiasing: The mean per-token probability raw score $z$ 4 conflates answer quality with prompt difficulty and reference phrasing. To remove static biases, a base reward $z$ 5 is computed by evaluating the LLM on the reference answer $z$ 6 alone (without reasoning $z$ 7). The debiased reward is then

$z$ 8

ensuring stability and comparability across prompts and confining the reward to $z$ 9.

Standard deviation filtering (adaptive curriculum): To avoid degenerate updates from uninformative samples (e.g., trivially easy or impossible prompts), the standard deviation $y$ 0 of reward values per prompt is tracked. Prompts with $y$ 1 (threshold updated via exponential moving average) are filtered out from training, leading to more informative sample selection and preserving effective exploration.

An ablation study confirms each component's necessity: omitting token-probability reward, debiasing, or standard deviation filtering yields systematically lower performance and less stable optimization (Yu et al., 23 Jun 2025).

4. Training Procedure and Implementation

The RLPR algorithm adopts a GRPO (Generalized Reward Policy Optimization)-style policy gradient, analogous to PPO with clip-based objectives. The policy update incorporates both reward and adaptive KL terms to prevent policy collapse. The key steps are as follows (Yu et al., 23 Jun 2025):

Initialize $y$ 2 (pre-trained LLM), set $y$ 3 (e.g., 0.5)
Per iteration:

Sample a batch of $y$ 4 prompts $y$ 5 and for each, $y$ 6 rollouts.
For each rollout, compute the raw per-token reward $y$ 7, the base score $y$ 8, and the debiased, clipped reward $y$ 9.
Compute the per-prompt reward standard deviation $\pi_\theta$ 0; filter out prompts where $\pi_\theta$ 1.
Update the policy by optimizing the PPO-clip surrogate:

$\pi_\theta$ 2

where $\pi_\theta$ 3, and $\pi_\theta$ 4.
Update $\pi_\theta$ 5 with exponential moving average.

Typical hyperparameter settings include batch sizes of 768 rollouts (96 prompts × 8 samples), learning rate $\pi_\theta$ 6, four policy updates per rollout, and entropy penalty $\pi_\theta$ 7 (Yu et al., 23 Jun 2025).

5. Empirical Results and Comparative Performance

RLPR has been evaluated across seven public benchmarks: four general-domain (MMLU-Pro, GPQA, TheoremQA, WebInstruct) and three mathematical (MATH-500, Minerva, AIME24), reporting accuracy in average@k as specified by each task (Yu et al., 23 Jun 2025). Representative results for Qwen2.5-7B are:

Model	Verifier	TheoremQA	Minerva	Overall
Qwen2.5-7B (base)	—	41.4	37.6	40.9
VeriFree (ℓ-reference)	—	47.6	49.0	49.4
General Reasoner (model)	1.5B verifier	52.1	51.7	52.0
RLPR	—	55.4	56.5	53.6

Key findings include:

RLPR outperforms VeriFree by +7.8 points on TheoremQA and +7.5 points on Minerva.
RLPR outperforms General Reasoner (using a dedicated 1.5B verifier) by +4.8 points on Minerva and +1.3 overall.
On general-domain benchmarks, RLPR yields a +3.5 point average gain over the Qwen2.5-7B base model.
Gains are robust across architectures: Llama3.1-8B, Gemma2-2B, etc.

Ablation experiments show sharp drops in TheoremQA/Minerva average accuracy when any component is removed:

No token-prob reward: 33.5 / 34.2
No debiasing: 52.7 / 54.1
No std-filtering: 52.5 / 55.1
Full RLPR: 55.4 / 56.5

6. Connections, Limitations, and Future Directions

RLPR contrasts both with rule-based reward RL and with “reference-based reward model” approaches such as Cooper/VerifyRM, which employ an explicit reward model $\pi_\theta$ 8 trained on positive/negative $\pi_\theta$ 9 triples, with the reference answer $o = (z \,\|\, y)$ 0 provided as context and the label derived by hybrid annotation and contrastive updates (Hong et al., 7 Aug 2025). Cooper’s approach, while effective at suppressing reward hacking and yielding a strong reward model (89.4% accuracy on VerifyBench-Math), is distinct from RLPR in that Cooper maintains an external reward model rather than relying on intrinsic model probabilities.

A comparable line of research, “Replacing Rewards with Examples” (RCE), frames RL objectives using success examples and recursive classification, but targets classical RL environments rather than LLMs (Eysenbach et al., 2021).

RLPR’s primary advantages are:

End-to-end scalability to arbitrary domains without the need for task-specific verifier code or reward model annotation.
Robust, stable training via internal probability normalization and adaptive data filtering.
Superior empirical performance, even surpassing state-of-the-art verifier-based approaches on multiple benchmarks (Yu et al., 23 Jun 2025).

Current limitations include the requirement for ground-truth reference answers and non-trivial extension to unsupervised or multimodal tasks. Future work may focus on integrating self-generated references, scaling to larger parameter regimes, or developing self-supervised variants for settings lacking references (Yu et al., 23 Jun 2025).