RL for Reasoning by Adaptively Revealing Rationales

Published 22 Jun 2025 in cs.LG and cs.AI | (2506.18110v1)

Abstract: We propose that reinforcement learning (RL) from partial expert demonstrations is not merely a training heuristic, but a promising framework for solving complex sequence generation tasks. Supervised fine-tuning (SFT) relies on dense ground-truth labels, which become increasingly costly as sequence length grows. RL, on the other hand, struggles with sparse rewards and a combinatorially large output space. We address this by introducing adaptive backtracking (AdaBack), a per-sample curriculum learning algorithm that reveals only a partial prefix of the target output during training. The supervision length is adjusted dynamically for each sample based on the model's past reward signal, allowing it to incrementally learn to complete reasoning chains by conditioning on correct partial solutions. We investigate this intermediate regime between SFT and RL and argue that per-sample curriculum learning is more than a trade-off between efficiency and generality, it can succeed in tasks with long sequences of latent dependencies where SFT and RL both fail to generalize. Using a synthetic task with latent parity constraints, we show that our adaptive curriculum over partial answers reliably solves problems that are otherwise intractable. On mathematical reasoning benchmarks (MATH, GSM8k), we find that curriculum learning enables models to solve problems that RL alone cannot, acquiring new reasoning capabilities through incremental exposure to partial solutions.

Abstract PDF Upgrade to Chat

Summary

The paper introduces AdaBack, a per-sample adaptive curriculum mechanism that adjusts revealed rationales based on reward signals.
It bridges the gap between supervised fine-tuning and RL for multi-step reasoning tasks, improving sample efficiency and generalization.
Empirical results on synthetic and real-world benchmarks show AdaBack outperforms traditional RL and SFT+RL pipelines in accuracy and diversity.

Adaptive Backtracking for Curriculum RL in Reasoning Tasks

"RL for Reasoning by Adaptively Revealing Rationales" (AdaBack) (2506.18110) presents a thorough study of reinforcement learning (RL) for sequence modeling tasks involving complex reasoning, focusing on the intractability of sparse-reward RL in structured domains, and introducing a per-sample curriculum method—adaptive backtracking (AdaBack)—that dynamically adjusts the supervision provided during training. The central contributions are both empirical and conceptual, with strong evidence showing that AdaBack bridges the gap between supervised fine-tuning (SFT) and RL and, in specific regimes, outperforms both.

Motivation and Problem Framing

Sequence models trained for multi-step reasoning—such as mathematical problem solving, code generation, and algorithmic tasks—face critical challenges:

Supervised fine-tuning (SFT) is highly sample-inefficient for long reasoning chains, as collecting ground-truth rationales is costly and the model may overfit to dataset artifacts or fail to generalize beyond exposure.
RL methods (e.g., REINFORCE, PPO, GRPO) struggle with combinatorially large output spaces and sparse binary feedback, and tend to reinforce only reasoning paths that the pretrained model already assigns non-negligible likelihood.

Recent literature has highlighted that RL applied directly to sequence models rarely expands their solution space, and primarily reweights known responses rather than eliciting new reasoning capabilities. AdaBack directly addresses this by positioning itself in the intermediate regime: it relaxes the requirements for dense supervision, but avoids naïve end-to-end RL's unproductive exploration.

Method: Adaptive Backtracking (AdaBack)

AdaBack implements a per-example adaptive curriculum: at each training step, only a prefix of the reference rationale is revealed, and the suffix is generated by the model. The length of the revealed prefix is dynamically adjusted for each example, based on recent reward signals. When the model's reward on a given example falls below a threshold, the supervision ratio is increased for that sample; when it exceeds the threshold, supervision is reduced, exposing the model to a longer unassisted completion.

Technical summary:

For each training instance, maintain an interval $[\rho_{\text{min}}^{(i)}, \rho_{\text{max}}^{(i)}]$ defining allowed supervision ratios.
At each step, sample $\rho^{(i)} \sim \text{Uniform}(\rho_{\text{min}}^{(i)}, \rho_{\text{max}}^{(i)})$ , reveal that portion of the rationale, and condition model generation on the partial answer.
Adjust intervals via a threshold $\tau$ on recent average reward: increase supervision if underperforming, otherwise reduce. This operates as a stochastic binary search.

Unlike previous attempts at curriculum RL (e.g., backtracking or static slicing in R3), AdaBack neither requires explicit step boundaries nor handcrafted schedules, making it generalizable to sequence tasks without delimiter cues.

Theoretical and Empirical Claims

A synthetic chain-of-parities task is used to analytically and empirically separate AdaBack from SFT, RL, and their combination. This task—modeled on stepwise parity functions—yields exponentially sparse rewards, making exploration by RL infeasible and requiring sample complexities for SFT that grow super-polynomially with sequence length. AdaBack, by contrast, solves the task efficiently by revealing all but one step early in curriculum, then iteratively backtracking supervision as competence improves.

Empirical results show:

For $L=16$ binary inputs with $n=1024$ train examples, SFT and RL (including R3) do not learn; AdaBack achieves high reward in under 700 iterations (see Figure 1 in the paper).
The mechanism enables gradient-based learners to focus on lower-complexity subproblems at each stage, leveraging incremental exposure for tractable learning.

Experiments on Real-world Reasoning Datasets

Assessments on MATH and GSM8k benchmarks, along with modified variants (Base-7 numeric, Tensor-2 concatenation), demonstrate that:

AdaBack consistently exceeds standard RL and SFT+RL pipelines in test accuracy, particularly in generalization-disfavoring settings (e.g., symbolic domain shift, long reasoning chains).
For example, on Tensor-2 GSM8k, AdaBack achieves 8.5% (1B) and 49.2% (3B) accuracy compared to 0.0% with RL-alone and 6.9/42.7% with SFT+RL.
In many cases, AdaBack applied to a base model matches or surpasses SFT-initialized models trained with RL, indicating AdaBack supports the emergence of new reasoning trajectories beyond those learned by SFT.
Pass@k (diversity metric) improvements under AdaBack support the claim that model output distributions are genuinely broadened, not merely reweighted.

Practical Implementation Considerations

For practical deployment in real-world reasoning tasks:

AdaBack can be integrated atop any RL pipeline with multi-rollout support (e.g., GRPO, PPO) and does not require step boundary annotations.
Reward models must be able to verify correctness given partial responses. For many domains, this is satisfied via evaluators checking for correctness or answer format.
AdaBack is robust in low-supervision regimes, making it suitable for settings where full rationales are unavailable or strategies must generalize across symbolic domains.
Computationally, the adaptive binary search over supervision ratios is efficient and imposes little overhead beyond standard RL; scheduling and memory costs scale linearly with the number of training examples, but can be globalized for very large datasets.

Limitations

Marginal benefit when the base model has seen most data during pretraining: Experiments with heavily pre-trained/-instructed models (e.g., LLaMA Instruct) on non-novel data show no improvements—AdaBack cannot expand solution space where pretraining/overfitting has saturated it.
In the high-data regime, per-sample adaptation loses effectiveness as examples are revisited rarely. The authors suggest cluster- or embedding-based adaptation as a scalable alternative.
Some tasks still require a reward function that is fine-grained enough to guide incremental reasoning; fully outcome-based, noisy, or highly discontinuous rewards may limit curriculum formation.

Implications and Future Directions

Practically, AdaBack lowers the data and engineering burden of developing reasoning-capable LMs for new domains: it enables more effective RL training in the presence of latent multi-step dependencies and data scarcity, and can facilitate the discovery of solution modes inaccessible to prior supervised or RL methods. The approach is generalizable, requiring no curriculum heuristics or step segmentation.

Theoretically, the findings provide a refined understanding of the limits of SFT and RL in sequence modeling. The strong separation between AdaBack and SFT/RL, especially in parity and symbolic generalization, challenges assumptions about what RL alone can achieve in current architectures and posits adaptive curriculum as a critical ingredient for the emergence of compositional reasoning.

Open research directions include scalable per-region adaptation, integration with richer reward models, and extension to unsupervised settings and continuous domains. AdaBack further motivates new benchmarks targeting reasoning compositionality and the incremental emergence of new skills.