Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning

Published 10 Feb 2025 in cs.CL and cs.LG | (2502.06781v1)

Abstract: Reasoning abilities, especially those for solving complex math problems, are crucial components of general intelligence. Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks. However, the complete technical details remain unrevealed, and the techniques that are believed certainly to be adopted are only reinforcement learning (RL) and the long chain of thoughts. This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through \textbf{O}utcome \textbf{RE}w\textbf{A}rd-based reinforcement \textbf{L}earning for mathematical reasoning tasks, where only binary outcome rewards are easily accessible. We theoretically prove that behavior cloning on positive trajectories from best-of-N (BoN) sampling is sufficient to learn the KL-regularized optimal policy in binary feedback environments. This formulation further implies that the rewards of negative samples should be reshaped to ensure the gradient consistency between positive and negative samples. To alleviate the long-existing difficulties brought by sparse rewards in RL, which are even exacerbated by the partial correctness of the long chain of thought for reasoning tasks, we further apply a token-level reward model to sample important tokens in reasoning trajectories for learning. With OREAL, for the first time, a 7B model can obtain 94.0 pass@1 accuracy on MATH-500 through RL, being on par with 32B models. OREAL-32B also surpasses previous 32B models trained by distillation with 95.0 pass@1 accuracy on MATH-500. Our investigation also indicates the importance of initial policy models and training queries for RL. Code, models, and data will be released to benefit future research\footnote{https://github.com/InternLM/OREAL}.

Abstract PDF Upgrade to Chat

Summary

The paper presents the OREAL framework that uses outcome reward-based reinforcement learning with Best-of-N sampling to optimize mathematical reasoning.
It introduces reward reshaping and token-level credit assignment to manage sparse, binary feedback and focus on key reasoning steps.
Experiments demonstrate that a 7B model achieves 94.0 pass@1 on MATH-500, while OREAL-32B sets a new state-of-the-art with 95.0 pass@1 accuracy.

Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning

This paper proposes a framework named OREAL for enhancing mathematical reasoning in models using outcome reward-based reinforcement learning (RL). It capitalizes on recent advances in LLMs and strives to address challenges associated with sparse and binary feedback typical in mathematical reasoning tasks.

Reinforcement Learning Framework for Mathematical Reasoning

OREAL Framework

OREAL stands for Outcome REwArd-based reinforcement Learning. It targets mathematical reasoning tasks, which are often hindered by the sparsity and binary nature of feedback when only correct or incorrect outcomes are known. OREAL introduces several enhancements to tackle these challenges effectively.

Behavior Cloning from Best-of-N Sampling: The paper establishes that behavior cloning over positive trajectories, selected via Best-of-N (BoN) sampling, is sufficient for learning an optimal KL-regularized policy under binary feedback conditions. The positive trajectory from BoN sampling converges to a distribution that is independent of the number of samples, ensuring efficient policy optimization.
Reward Reshaping for Negative Samples: To maintain gradient consistency between positive and negative samples, a reward-shaping mechanism is introduced. This ensures consistent updates by compensating for BoN's undersampling of negative gradients, thus allowing effective learning from failed attempts.
Token-Level Credit Assignment: Addressing the sparse rewards in long reasoning chains, OREAL utilizes a token-level reward model for inferring step-wise importance in reasoning trajectories. This model captures the contribution of individual tokens towards the final output, facilitating a more focused learning process.
Figure 1: Overall performance between OREAL-32B and some competitive baselines.

Performance and Implementation

Experimental Results

OREAL demonstrates significant improvements in mathematical reasoning capabilities across different model scales. In a notable achievement, a 7B model trained with OREAL reaches 94.0 pass@1 accuracy on the MATH-500 benchmark, a performance previously attainable only by 32B models. The model OREAL-32B sets a new state-of-the-art with 95.0 pass@1 accuracy, surpassing all previously reported results for this model size.

Implementation Details

Policy Initialization: The framework initializes policy models using fine-tuned base models, specifically Qwen2.5 series, to ensure a robust starting point for RL.
Reinforcement Learning Process: During RL, a combination of model-based and rule-based verifiers provides binary outcome rewards. These verifiers assess the correctness of output sequences, supporting the training of both the policy and token-level reward models.
Skill Enhancement Strategy: Recognizing persistent errors in certain problem types, a skill-based enhancement strategy supplements the RL process by focusing on underrepresented skills in the training data, thus aiding in overall model performance.
Figure 2: Average test accuracy of 7B models across different training steps.

Theoretical and Practical Implications

Theoretical Insights

OREAL's reliance on theoretical insights into BoN sampling and reward reshaping offers robust guidelines for leveraging binary outcome feedback effectively. By ensuring gradient consistency and optimizing policy through token-level insights, OREAL provides a scalable approach to mathematical reasoning tasks.

Practical Implications

The success of OREAL is a testament to the importance of outcome-based RL in improving LLM performance without additional distillation. These advancements have far-reaching implications for developing advanced AI systems capable of complex reasoning tasks.

Figure 3: Token-level reward model score visualization for a correct response.

Conclusion

OREAL represents a significant step forward in applying reinforcement learning to mathematical reasoning in LLMs. Through combining theoretical rigor with practical application, OREAL advances the state-of-the-art in reasoning tasks and sets a new performance benchmark for both 7B and 32B models. Future work will likely focus on refining the underlying data and policy models to further enhance RL's efficacy in reasoning tasks.

Markdown Report Issue