Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models

Published 16 Jun 2025 in cs.LG, cs.AI, and cs.CL | (2506.13923v2)

Abstract: We study the process through which reasoning models trained with reinforcement learning on verifiable rewards (RLVR) can learn to solve new problems. We find that RLVR drives performance in two main ways: (1) by compressing pass@$k$ into pass@1 and (2) via "capability gain" in which models learn to solve new problems that they previously could not solve even at high $k$. We find that while capability gain exists across model scales, learning to solve new problems is primarily driven through self-distillation. We demonstrate these findings across model scales ranging from 0.5B to 72B parameters on >500,000 reasoning problems with prompts and verifiable final answers across math, science, and code domains. We further show that we can significantly improve pass@$k$ rates by leveraging natural language guidance for the model to consider within context while still requiring the model to derive a solution chain from scratch. Based of these insights, we derive $\text{Guide}$ -- a new class of online training algorithms. $\text{Guide}$ adaptively incorporates hints into the model's context on problems for which all rollouts were initially incorrect and adjusts the importance sampling ratio for the "off-policy" trajectories in order to optimize the policy for contexts in which the hints are no longer present. We describe variants of $\text{Guide}$ for GRPO and PPO and empirically show that Guide-GRPO on 7B and 32B parameter models improves generalization over its vanilla counterpart with up to 4$\%$ macro-average improvement across math benchmarks. We include careful ablations to analyze $\text{Guide}$'s components and theoretically analyze Guide's learning efficiency.

Abstract PDF Upgrade to Chat

Summary

The paper introduces the Guide algorithm that adaptively integrates hints to significantly improve pass@k rates in reasoning tasks.
It demonstrates that self-distillation, combined with capability gain, converts guided rollouts into unguided successes during testing.
Guide-GRPO achieves up to 4% macro-average improvement and scales effectively to larger models and extended context lengths.

Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models

The paper "Adaptive Guidance Accelerates Reinforcement Learning of Reasoning Models" explores reinforcement learning with verifiable rewards (RLVR) to enhance the capabilities of reasoning models across math, science, and coding benchmarks. It introduces a new class of training algorithms, Guide, which adaptively incorporates hints into the model's context, thereby optimizing the learning process during RLVR. This approach leverages self-distillation and capability gain, significantly improving pass@ $k$ rates.

Reinforcement Learning with Verifiable Rewards (RLVR)

Self-Distillation vs. Capability Gain

The authors identify two major mechanisms driving improvements in reasoning models trained with RLVR:

Self-Distillation: It compresses pass@ $k$ into pass@1 by shifting probability mass toward answers reachable with multiple attempts. The majority of RLVR improvements come from self-distillation, especially in medium-sized models.
Capability Gain: Models learn to solve new problems previously unsolvable even with high $k$ . Effective but less dominant across larger scales.
Figure 1: Capability gain (left), self-distillation (middle), and combined progress (capability gain + self-distillation; right) across training steps on all test sets.

Guide Algorithm

The Guide algorithm provides dynamic and in-context guidance specifically on prompts that were entirely unsolved. It strategically adjusts the policy training using guided trajectories, enabling models to discover solutions they couldn't naturally achieve through naive sampling. The algorithm incorporates off-policy importance corrections to ensure stability:

GRPO and PPO Variants: Guide extends GRPO and PPO by incorporating guided rollouts only when prompts fail all attempts, applying importance sampling for off-policy correction.
Figure 2: Impacts of guidance on correct rollouts. Guidance vs. no-guidance pass@k performance on Qwen-2.5-Math-7B.

Experimental Results

Guide-GRPO Effectiveness

Experiments demonstrate the efficacy of Guide-GRPO, achieving up to 4% macro-average improvement across math benchmarks compared to vanilla methods:

Improved Pass@k Rates: Guide-GRPO enhances pass@k performance by converting guided rollouts into unguided success without relying on guidance during test time.
Training Dynamics: Guide-GRPO maintains higher entropy and achieves longer token generation, indicating exploration retention while achieving higher correctness.
Figure 3: Comparison of Guide-GRPO with baseline methods across training steps. Guide-GRPO ultimately outperforms baselines with better solution diversity.

Scaling with Model Size

Guide-GRPO's consistent improvement across larger models (32B) and context lengths (8K tokens) underscores its scalability potential for future AI systems. Results reveal substantial gains in pass@1 metrics, showcasing enhanced reasoning capabilities even in larger models:

Figure 4: Comparison of train-time rewards under different policy loss computation with guided trajectories.

Conclusion

The paper asserts that incorporating adaptive guidance substantially accelerates learning in RLVR, making unreachable solutions accessible, thereby expanding self-distillation potential. Guide-GRPO paves the way for more efficient sampling methodologies. By unlocking improved reasoning skills without guidance at test time, Guide-GRPO models exhibit distinct advantage, offering prospects of impactful applications across diverse domains including mathematics and theoretical problem solving.

Future Directions

Future work should explore more personalized guidance strategies and extend Guide's methodology to other domains such as robotics and agent-based simulations, offering generality verification. Additionally, continued scaling studies are essential to understanding Guide’s utility across different computational scopes and learning setups. The open-sourced framework promises accessibility for broader experimental applications and collaborations.

Markdown