Learning to Reason under Off-Policy Guidance

Published 21 Apr 2025 in cs.LG, cs.AI, and cs.CL | (2504.14945v5)

Abstract: Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning with verifiable rewards~(\textit{RLVR}). However, existing \textit{RLVR} approaches are inherently ``on-policy'', limiting learning to a model's own outputs and failing to acquire reasoning abilities beyond its initial capabilities. To address this issue, we introduce \textbf{LUFFY} (\textbf{L}earning to reason \textbf{U}nder o\textbf{FF}-polic\textbf{Y} guidance), a framework that augments \textit{RLVR} with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Specifically, LUFFY combines the Mixed-Policy GRPO framework, which has a theoretically guaranteed convergence rate, alongside policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Compared with previous RLVR methods, LUFFY achieves an over \textbf{+6.4} average gain across six math benchmarks and an advantage of over \textbf{+6.2} points in out-of-distribution tasks. Most significantly, we show that LUFFY successfully trains weak models in scenarios where on-policy RLVR completely fails. These results provide compelling evidence that LUFFY transcends the fundamental limitations of on-policy RLVR and demonstrates the great potential of utilizing off-policy guidance in RLVR.

Abstract PDF Upgrade to Chat

Summary

The paper introduces LUFFY that significantly improves reasoning in reinforcement learning by integrating off-policy guidance with on-policy rollouts.
The methodology employs regularized importance sampling and policy shaping to maintain high exploration while imitating high-quality reasoning patterns.
Empirical evaluations on benchmarks like AIME and OlympiadBench show an average performance gain of over +7.0 points compared to traditional zero-RL methods.

Learning to Reason under Off-Policy Guidance

Introduction to Off-Policy Guidance in Reinforcement Learning

The paper "Learning to Reason under Off-Policy Guidance" (2504.14945) introduces a novel framework, LUFFY, that enhances the capabilities of large reasoning models (LRMs) by integrating off-policy reasoning traces into reinforcement learning (RL). This method addresses the limitations of traditional on-policy learning by dynamically balancing imitation and exploration, thus promoting the development of generalizable reasoning skills.

The core challenge addressed by LUFFY lies in overcoming the bottleneck of zero-shot reinforcement learning (zero-RL) approaches that are confined to on-policy rollouts. Traditional on-policy methods focus solely on the model's self-generated outputs, thus limiting the acquisition of enhanced reasoning capabilities. Consequently, there arises a need for incorporating external guidance mechanisms to surpass inherent cognitive barriers.

LUFFY Framework and Methodology

LUFFY operates by amalgamating off-policy traces with on-policy rollouts within the RL framework. The integrated approach not only amplifies imitation of high-quality reasoning patterns but also preserves the model's exploratory capacity. The framework employs regularized importance sampling, which prevents premature convergence to suboptimal solutions by dynamically emphasizing low-probability, yet critical, actions.

Figure 1: LUFFY integrates off-policy reasoning traces into reinforcement learning by combining them with on-policy rollouts. Policy shaping emphasizes low-probability but crucial actions, enabling a balance between imitation and exploration for more generalizable reasoning.

The implementation of LUFFY involves using policy shaping techniques to maintain entropy and facilitate exploration, which is essential in avoiding the risks of superficial imitation. Mixed-policy advantages are computed to ensure that both on-policy and off-policy rollouts are optimally leveraged, resulting in a robust learning mechanism.

Empirical Evaluation and Results

Extensive empirical evaluations underscore LUFFY's superior performance across multiple benchmarks, including AIME 2024, AMC, and OlympiadBench. LUFFY consistently demonstrates significant improvements over baseline zero-RL methods, evidenced by an average performance gain of over +7.0 points across six competition-level benchmarks.

Figure 2: Overall performance across six competition-level benchmarks (AIME 2024, AIME 2025, AMC, MATH-500, Minerva Math, and OlympiadBench). LUFFY achieves an average score of 49.6, delivering a substantial performance gain of over +7.0 points on average compared to existing zero reinforcement learning methods.

LUFFY also shows remarkable generalization capabilities on out-of-distribution tasks, outperforming approaches based on imitation and on-policy RL by over +6.2 points. The results highlight LUFFY's ability to imitate high-quality reasoning patterns while maintaining explorative capabilities, showcasing its potential in addressing complex reasoning challenges.

Training Dynamics and Exploration

The training dynamics reveal that LUFFY maintains a higher level of entropy throughout the RL process compared to purely on-policy methods, thereby sustaining a significant degree of exploration. The introduction of policy shaping through regularized importance sampling plays a pivotal role in maintaining this exploration, as depicted in the entropy dynamics over training iterations.

Figure 3: Training dynamics of LUFFY compared with on-policy RL. Left: outcome training rewards; Middle: generation length; Right: generation entropy.

Conclusion

In conclusion, LUFFY represents a significant advancement in integrating off-policy reasoning into RL for enhanced reasoning capabilities. By balancing imitation with exploration, LUFFY surpasses the traditional limitations of on-policy learning in LRMs. Future research may explore the extension of this framework to broader domains and further refine the policy shaping mechanisms to optimize exploration potentials under off-policy guidance.