Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening

Published 3 Jun 2025 in cs.LG | (2506.02355v2)

Abstract: Reinforcement learning is emerging as a primary driver for improving LLM reasoning capabilities. A fundamental question is whether current reinforcement learning algorithms -- such as Group Relative Policy Optimization (GRPO), the de facto standard algorithm used to improve LLM reasoning -- merely sharpen the base model's distribution around problems it can already solve. We investigate this question in the context of formal theorem proving, which has access to a perfect verifier. We identify a degenerate rank bias in GRPO in which highly probable trajectories are reinforced and rare ones are neglected. This results in distribution sharpening: the model can solve some problems with fewer samples, but underperforms simply sampling more solutions from the original model. To overcome GRPO's rank bias we introduce unlikeliness reward, a simple method for explicitly up-weighting rare but correct solutions. We show that unlikeliness reward mitigates rank bias and improves pass@$N$ across a large range of $N$ in both synthetic and real theorem proving settings. We also uncover an unexpected link between rank bias and a seemingly mundane hyperparameter -- the number of updates per batch -- that leads to a second, complementary mitigation. We combine our insights into a revised GRPO training recipe for formal theorem proving, yielding an open pipeline that achieves competitive performance to DeepSeek-Prover-V1.5-RL on the miniF2F-test benchmark. We release our implementation at https://github.com/AndreHe02/rewarding-unlikely-release

Abstract PDF Upgrade to Chat

Summary

The paper identifies GRPO's rank bias that favors high-probability proofs, leading to suboptimal large-N performance in reinforcement learning.
It introduces an unlikeliness reward that penalizes common correct solutions, effectively uplifting low-probability proofs and improving pass@N metrics.
Additional experiments show that increasing PPO epochs further mitigates bias, enhancing sample diversity and overall problem-solving capacity.

Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening

Introduction

This work systematically examines the multi-sample performance limitations of Group Relative Policy Optimization (GRPO) in reinforcement learning (RL) fine-tuning of LLMs, specifically in formal theorem proving. Contrary to the expectation that RL should systematically uplift all correct solutions, the analysis reveals a pronounced rank bias: GRPO predominantly reinforces high-probability solutions, neglecting rare—but correct—proofs. This bias induces distribution sharpening and impairs pass@ $N$ performance, especially for large values of $N$ , which is critical for settings like theorem proving where verification is exact and the practical protocol involves extensive sampling.

Figure 1: Schematic depiction of GRPO’s rank bias, where model updates reinforce already probable solutions but fail to uplift low-probability correct solutions; the proposed unlikeliness reward explicitly mitigates this bias.

GRPO and Pass@ $N$ : Diagnosing the Rank Bias

Empirical Characterization

Empirical studies on a Lean-based theorem proving task demonstrate that fine-tuning DeepSeek-Prover-V1.5-SFT with standard GRPO protocol results in significant gains in pass@ $N$ for small $N$ (e.g., pass@1 to pass@16), but deteriorates performance for large $N$ compared to the base model. The implication is clear: GRPO sharpens the model’s distribution, increasing the likelihood of already probable solutions but failing to improve the mass on rare, correct completions. This narrowing effect becomes detrimental under protocols that rely on diversity through extensive sampling.

Figure 2: GRPO training leads to improved pass@ $N$ at small values but underperforms the base SFT model for large $N$ .

Theoretical analysis reveals that optimizing pass@ $N$ for large $N$ demands the RL objective not just increase the probability of top solutions, but also specifically uplift low-probability correct responses. Simulations confirm that a marginal $\times$ (1 + $\epsilon$ ) increase in correct solution probability only brings meaningful improvements in pass@ $N$ when those probabilities are in the $O(1/N)$ regime.

Figure 3: Analytical relationship between initial solution probability $p_0$ , the RL-induced multiplicative improvement, and the expected pass@ $N$ gain; significant improvements at large $N$ require uplifting rare solutions.

Quantifying Rank Bias

To isolate the source of the suboptimal pass@ $N$ , the authors introduce the uplift rate $u_j$ , measuring the tendency of GRPO to increase model probability for each rank among positive (correct) samples. The analysis unambiguously shows that GRPO almost never uplifts the model probability of the lowest-ranked (i.e., rarest) correct samples, confirming the existence and severity of rank bias.

Figure 4: Uplift rate $u_j$ as a function of proof rank—GRPO disproportionately boosts high-probability correct samples and neglects rare ones.

Mitigating Rank Bias: Unlikeliness Reward and PPO Epoch Tuning

Unlikeliness Reward

To counteract rank bias, the work proposes the unlikeliness reward, which modifies the reward assignment by penalizing high-probability correct solutions and up-weighting low-probability ones. Specifically, the reward for a correct sample is scaled by a monotonic function of its inverse rank among policy outputs. This intervention directly incentivizes exploration and distributional diversity without impacting the expected gradient direction for correct samples.

Effect of Additional PPO Epochs

A complementary finding is that increasing the number of PPO epochs per batch attenuates rank bias. As high-probability solutions quickly saturate the PPO clipping bound, subsequent optimization steps emphasize low-probability samples, providing them with stronger learning signals. However, this approach brings a substantial increase in computational cost and potential instability, making unlikeliness reward the more sample-efficient and robust mitigation.

Experimental Validation

Main Results

Experiments on a Lean theorem proving benchmark compare several GRPO variants—including ones with unlikeliness reward and increased PPO epochs. Both modifications yield substantial improvements in pass@ $N$ for large $N$ , with the unlikeliness reward being particularly effective. These improvements are not accompanied by significant degradation in single-sample accuracy, and overall, the total number of problems solved during training increases.

Figure 5: Pass@ $N$ performance across GRPO variants on the validation set; the unlikeliness reward and increased PPO epochs both markedly improve large- $N$ accuracy.

Rank Bias and Sample Diversity Analysis

Extending the uplift rate analysis to GRPO variants, unlikeliness reward not only neutralizes the original bias but actually reverses it for the lowest-probability correct solutions. Furthermore, tracking the number of unique proofs generated shows that unlikeliness reward substantially increases sample diversity—a key factor for high pass@ $N$ .

Figure 6: Uplift rate by rank for GRPO variants; the unlikeliness reward dramatically increases reinforcement for rare, correct samples.

Figure 7: Evolution of unique proof diversity during training; unlikeliness reward yields a pronounced resurgence in diversity.

Large-scale Benchmarking

When scaling to a larger and more challenging Lean problem set, the GRPO-Unlikeliness-2 method achieves pass@32 and pass@128 metrics competitive with the strongest published RL models (DeepSeek-Prover-V1.5-RL) on the MiniF2F-test benchmark, establishing the practical efficacy of this simple modification to reward engineering.

Theoretical and Practical Implications

This work clarifies a central deficit in the application of RL to structured, verifiable reasoning tasks: by default, standard objectives like GRPO are misaligned with the desiderata of multi-sample discovery because they reinforce exploitation over exploration. The introduction of the unlikeliness reward directly connects reward structure to pass@ $N$ objectives, offering a methodologically principled and empirically validated solution.

The finding that PPO epoch count affects rank bias also suggests broader implications for the optimizer’s role in RL distributional behavior, motivating future research at the intersection of optimization schedules and reward shaping.

Conclusion

The analysis in "Rewarding the Unlikely: Lifting GRPO Beyond Distribution Sharpening" (2506.02355) identifies and remedies the critical distribution sharpening and rank bias of GRPO in formal theorem proving RL pipelines. The unlikeliness reward emerges as a lightweight yet powerful intervention, enabling models to achieve state-of-the-art or near state-of-the-art multi-sample performance while sustaining sample diversity and maintaining efficiency. This work establishes a new baseline methodology for RL fine-tuning in settings where diversity and exploration are critical, with implications extending to other structured generation tasks where exact verification enables large-scale sampling and evaluation.

Markdown