ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

Published 30 May 2025 in cs.CL and cs.AI | (2505.24864v1)

Abstract: Recent advances in reasoning-centric LLMs have highlighted reinforcement learning (RL) as a promising method for aligning models with verifiable rewards. However, it remains contentious whether RL truly expands a model's reasoning capabilities or merely amplifies high-reward outputs already latent in the base model's distribution, and whether continually scaling up RL compute reliably leads to improved reasoning performance. In this work, we challenge prevailing assumptions by demonstrating that prolonged RL (ProRL) training can uncover novel reasoning strategies that are inaccessible to base models, even under extensive sampling. We introduce ProRL, a novel training methodology that incorporates KL divergence control, reference policy resetting, and a diverse suite of tasks. Our empirical analysis reveals that RL-trained models consistently outperform base models across a wide range of pass@k evaluations, including scenarios where base models fail entirely regardless of the number of attempts. We further show that reasoning boundary improvements correlates strongly with task competence of base model and training duration, suggesting that RL can explore and populate new regions of solution space over time. These findings offer new insights into the conditions under which RL meaningfully expands reasoning boundaries in LLMs and establish a foundation for future work on long-horizon RL for reasoning. We release model weights to support further research: https://huggingface.co/nvidia/Nemotron-Research-Reasoning-Qwen-1.5B

Abstract PDF Upgrade to Chat

Summary

The paper's main contribution is the introduction of ProRL, which extends reinforcement learning training to unlock novel reasoning pathways in LLMs.
It combines strategies like KL divergence control, reference policy resetting, and dynamic entropy management to overcome short-term sampling limitations.
Results show significant pass@1 improvements across diverse tasks, highlighting enhanced performance in math, code, STEM, and logic puzzles.

Expanding Reasoning Boundaries in LLMs: The Role of ProRL

The paper "ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in LLMs" introduces a methodology - Prolonged Reinforcement Learning (ProRL) - capable of enhancing reasoning capabilities in LLMs far beyond the sampling efficiency improvements typically achieved. ProRL combines techniques such as KL divergence control, reference policy resetting, and diverse task management to explore new solution spaces and achieve novel reasoning strategies that base models cannot access.

Methodological Innovations in ProRL

ProRL distinguishes itself by focusing on sustained reinforcement learning training that goes beyond short-term superficial gains. The methodology leverages existing RL algorithms like GRPO, optimized through additional strategies to mitigate entropy collapse and stabilize long-horizon training:

Entropy Management: To combat premature entropy collapse, which restricts exploration, ProRL employs decoupled clipping and dynamic sampling from the DAPO algorithm. These techniques encourage broader exploration and maintain output diversity, critical for effective policy updates.
KL Divergence with Resetting: Introducing a KL divergence penalty maintains entropy while ensuring stable divergence of the training policy from the reference. Periodically resetting the reference policy helps prevent stagnation, facilitating continuous model improvement over extended training durations.

These elements collectively ensure that ProRL can manage the computational complexities associated with prolonged training while achieving meaningful advancements in reasoning capabilities.

Figure 1: Benefits of prolonged reinforcement learning (ProRL). Left: Pass@1 and Pass@16 scales with ProRL training. Middle: ProRL leads to more novel solutions reflected by higher Creativity Index. Right: The model greatly surpasses the base model across diverse tasks.

Implementation of Nemotron-Research-Reasoning-Qwen-1.5B

The practical application of ProRL is demonstrated through the Nemotron-Research-Reasoning-Qwen-1.5B model, which is trained using extensive RL over diverse problem domains including math, code, STEM, logic puzzles, and instruction following:

Training Setup: Utilizing computational resources from NVIDIA-H100-80GB nodes, the training runs for 16k GPU hours, significantly enhancing the model's proficiency in reasoning tasks compared to its base model and achieving comparable results to larger models.
Task Diversity: An expansive dataset covering five task domains with clear reward signals is used, promoting generalization and presenting rigorous RL algorithm evaluations across fundamentally different environments.

The results indicate substantial performance gains with average pass@1 improvements ranging from 14.7% in mathematical reasoning to 54.8% in logic puzzles, showcasing ProRL’s effectiveness in broadening reasoning boundaries.

Evaluating The Expanded Reasoning Boundaries

ProRL’s true impact comes through its capacity for reasoning expansions, particularly evident in domains where the base model struggles initially:

Performance Metrics: Different evaluation benchmarks reveal three regimes of model performance: Diminish, Plateau, and Sustained gains. ProRL shows sustained performance improvements with prolonged training cycles, notably in complex domains such as code generation, where exploration over diverse instances is crucial.
Discovering Novel Solutions: ProRL enables the emergence of genuinely new reasoning trajectories, evidenced by substantial shifts in reasoning accuracy distributions and the ability to generalize to out-of-domain tasks (Figure 2). $Figure 2$

Figure 2: Pass@k comparison of the base model, an intermediate checkpoint, and the final RL-trained models. Trends are grouped into three regimes: Diminish, Plateau, and Sustained.

Broader Implications and Future Directions

This research dramatically challenges existing assumptions about RL's limitations in developing reasoning capabilities within LLMs. ProRL underscores the potential for sustained RL training to facilitate new pathways in reasoning, offering a foundation for future research into long-horizon RL methodologies. The findings suggest the necessity for models to explore extensively across diverse tasks to develop generalized reasoning capabilities.

In conclusion, ProRL establishes a clear precedent for future exploration in sustained reinforcement learning strategies, providing pathways for developing sophisticated and adaptable AI systems that can solve more complex and diverse reasoning challenges efficiently.

Markdown