Papers
Topics
Authors
Recent
Search
2000 character limit reached

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Published 18 Apr 2025 in cs.AI, cs.CL, and cs.CV | (2504.13837v2)

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of LLMs, particularly on mathematics and programming tasks. Similar to how traditional RL helps agents explore and learn new strategies, RLVR is believed to enable LLMs to continuously self-improve, thus acquiring novel reasoning abilities beyond those of the corresponding base models. In this study we critically examine the current state of RLVR by systematically probing the reasoning capability boundaries of RLVR-trained LLMs across various model families, RL algorithms, and math, coding, and visual reasoning benchmarks, using pass@k at large k values as the evaluation metric. Surprisingly, we find that the current training setup does not elicit fundamentally new reasoning patterns. While RLVR-trained models outperform their base models at small k (e.g., k = 1), the base models achieve a higher pass@k score when k is large. Coverage and perplexity analyses show that the observed reasoning abilities originate from and are bounded by the base model. Treating the base model as an upper bound, our quantitative analysis shows that six popular RLVR algorithms perform similarly and remain far from optimal in leveraging the potential of the base model. By contrast, we find that distillation can introduce new reasoning patterns from the teacher and genuinely expand the model's reasoning capabilities. Overall, our findings suggest that current RLVR methods have not yet realized the potential of RL to elicit truly novel reasoning abilities in LLMs. This highlights the need for improved RL paradigms, such as continual scaling and multi-turn agent-environment interaction, to unlock this potential.

Summary

  • The paper shows that RLVR biases the output sampling process rather than generating fundamentally new reasoning patterns in LLMs.
  • Methodology using the pass@k metric across math, code, and visual tasks reveals that base models catch up as more attempts are allowed.
  • Results indicate that distillation can more effectively expand reasoning boundaries compared to the limited gains offered by RLVR.

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Introduction

The paper examines the use of Reinforcement Learning with Verifiable Rewards (RLVR) to enhance reasoning capabilities in LLMs, particularly in mathematics and programming domains. Despite RLVR's empirical success in improving these capabilities, its actual impact on incentivizing novel reasoning patterns beyond what base models can achieve has not been critically evaluated. The paper questions whether RLVR indeed leads to the development of new reasoning abilities or merely biases the sampling process to yield correct answers more efficiently.

Reinforcement Learning with Verifiable Rewards (RLVR)

RLVR applies reinforcement learning to LLMs to enhance reasoning efficiency using automatically computable rewards. These rewards are determined by whether the model's output aligns with true solutions in tasks like mathematics or code correctness. The assumption is that RLVR can lead to self-improvement of LLMs, bringing in advanced reasoning behaviors absent in base models. Figure 1

Figure 1

Figure 1: The effect of RLVR on LLM's reasoning ability. All reasoning paths in the RLVR model are already present in the base model.

Methodology and Metrics

Pass@kk Metric

The researchers employed the pass@kk metric, which accounts for the number of correct samples generated out of kk attempts. This metric evaluates the boundary of reasoning capabilities of both base and RL-trained models, addressing whether RLVR achieves fundamentally new reasoning patterns.

Experimental Analysis

RLVR in Mathematical Reasoning

Using various LLM families and benchmarks such as GSM8K and MATH500, RLVR-trained models displayed higher pass@1 rates compared to base models. However, as kk increased, base models consistently matched or surpassed RLVR models. Figure 2

Figure 2: Pass@k curves of base models and their zero-RL-trained counterparts across multiple mathematical benchmarks.

RLVR in Code Generation and Visual Reasoning

Similar findings were observed in code generation and visual reasoning tasks. Base models eventually surpassed RLVR models in pass@kk scores at larger values of kk, indicating broader reasoning coverage in base models. Figure 3

Figure 3

Figure 3: Pass@kk curves of base models and zero-RL counterparts. (Left) Code Generation. (Right) Visual Reasoning.

Root Cause Analysis

The findings showed that RLVR does not enhance reasoning capabilities beyond the boundaries established by base models. Perplexity analyses indicated that paths exploited by RLVR are already present in base models' distributions, suggesting that RLVR primarily biases output distribution rather than expanding reasoning ability. Figure 4

Figure 4

Figure 4: Perplexity distribution of responses from different sources, evaluated by the base and RL models.

Distillation as an Alternative

Distillation, in contrast to RLVR, was found to genuinely expand reasoning boundaries by introducing new knowledge into the model. This suggests that distillation might be a more effective method than RLVR for enhancing reasoning capabilities in LLMs.

Conclusion

The research concludes that RLVR, in its current form, does not contribute fundamentally new reasoning abilities to LLMs but merely enhances sampling efficiency within the reasoning scope of base models. This highlights the necessity for a revised understanding of RLVR's role and suggests a need for more advanced paradigms to exceed the reasoning capacities of LLM base models. Future research might consider integrating alternatives like distillation or novel RL strategies that overcome the limitations identified in this study.

Paper to Video (Beta)

Whiteboard

Explain it Like I'm 14

What is this paper about?

This paper looks at a popular way to train LLMs called “Reinforcement Learning with Verifiable Rewards” (RLVR). People often think RLVR helps models learn brand‑new ways to reason, especially for math and coding. The authors test that belief and ask a simple question: does RLVR really teach models new reasoning skills, or does it just make them pick good answers more efficiently from what they already know?

What questions are the researchers asking?

They focus on two main questions, explained in everyday language:

  • Does RLVR actually give LLMs new reasoning abilities that go beyond what their original “base” versions can do?
  • If not, then what exactly does RLVR change in the model?

How did they study it?

The authors tested many models, tasks, and training methods across math, coding, and visual reasoning. Their key tool is a simple idea called “pass@k.”

  • What is pass@k? Imagine you give a model up to k tries to solve a problem. If any try is correct, the model “passes” that problem. For example, pass@1 is like one shot. Pass@128 is like giving the model 128 chances and counting it as solved if it gets it right at least once.
  • Why use large k? A single try (pass@1) can underestimate what a model is capable of. With more tries, you test the model’s “upper bound” (its full potential) more fairly. If a base model can eventually find a correct path with enough tries, that means it had the ability all along—it just needed more sampling.
  • Verifiable rewards: In math, the final numeric answer must match the truth. In code, the program must pass unit tests. These automatic checks make RL training scale without human grading.
  • Checking for lucky guesses: The team filtered out easy-to-guess math questions and manually inspected chains of thought (the model’s step-by-step reasoning) to confirm that correct answers came from valid reasoning, not just random luck.
  • Perplexity analysis: Perplexity is a measure of “how surprising” a response is to a model. Low perplexity means the model thinks the response is likely. The team used this to see if RL-trained reasoning paths were already likely under the base model’s distribution.

What did they find, and why does it matter?

Here are the main findings, explained simply:

  • RLVR boosts accuracy when you only allow a few tries. With pass@1 or other small k values, RL‑trained models often beat their base versions. This means RLVR makes the model “aim” better—it samples correct answers more efficiently.
  • But at large k, base models catch up and even surpass RL models. When you allow tens or hundreds of tries, base models solve as many or more problems than RL‑trained ones. In other words, the base model could already solve those problems—it just needed more sampling.
  • RLVR doesn’t create truly new reasoning paths. The reasoning routes the RL‑trained models use are already inside the base models’ “bag of possibilities.” RLVR mainly reweights the model’s output toward rewarded paths, not inventing new ones.
  • RLVR narrows exploration. RL training makes models focus on certain “good” paths, which improves efficiency but reduces how widely they explore. That can shrink the overall set of problems they can solve when you sample a lot (the “reasoning boundary”).
  • This pattern holds across math, coding, and visual reasoning. It’s not just one model or one task—the same trend shows up broadly.
  • Distillation is different and can add new knowledge. Distillation is like a student learning from a stronger teacher’s detailed solutions. The distilled model often shows a wider reasoning boundary than the base model, suggesting it truly learned new patterns, not just reweighted old ones.
  • Different RL algorithms (like PPO, GRPO, RLOO) perform similarly. They all help sampling efficiency a bit, but none are close to the theoretical “upper bound” of what the base model could do with many tries.
  • Longer RL training improves pass@1 but can shrink the high‑k boundary. Over time, the model gets better at quick wins but worse at broad exploration.

Why this matters: Many people hope RLVR will make models learn fundamentally new ways to think. This study suggests RLVR mostly helps models pick better from what they already can do, rather than expanding their real reasoning capabilities.

What does this mean for the future?

  • RLVR alone may not be enough to push reasoning beyond the base model’s limits. It’s helpful for making models more efficient at finding correct answers quickly, but it may not unlock truly new skills.
  • Distillation looks promising for adding new reasoning patterns. Learning from a stronger “teacher” model’s long, worked‑out solutions can genuinely expand a model’s abilities.
  • We might need new training ideas. Since language is a huge “action space” (far bigger than games like Go or Atari), and RLVR starts from strong pretrained priors, exploration can get stuck inside what the base model already knows. Future methods should find ways to safely explore beyond the base model’s prior, or use alternative approaches that better inject new knowledge.

In short: RLVR acts like a coach that makes a student pick their best-known strategies faster. Distillation acts like a teacher who actually teaches new strategies. If we want LLMs to truly think in new ways, we’ll likely need more than just RLVR.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

GitHub

Tweets

Sign up for free to view the 132 tweets with 2526 likes about this paper.