- The paper introduces AdaBack, a per-sample adaptive curriculum mechanism that adjusts revealed rationales based on reward signals.
- It bridges the gap between supervised fine-tuning and RL for multi-step reasoning tasks, improving sample efficiency and generalization.
- Empirical results on synthetic and real-world benchmarks show AdaBack outperforms traditional RL and SFT+RL pipelines in accuracy and diversity.
Adaptive Backtracking for Curriculum RL in Reasoning Tasks
"RL for Reasoning by Adaptively Revealing Rationales" (AdaBack) (2506.18110) presents a thorough study of reinforcement learning (RL) for sequence modeling tasks involving complex reasoning, focusing on the intractability of sparse-reward RL in structured domains, and introducing a per-sample curriculum method—adaptive backtracking (AdaBack)—that dynamically adjusts the supervision provided during training. The central contributions are both empirical and conceptual, with strong evidence showing that AdaBack bridges the gap between supervised fine-tuning (SFT) and RL and, in specific regimes, outperforms both.
Motivation and Problem Framing
Sequence models trained for multi-step reasoning—such as mathematical problem solving, code generation, and algorithmic tasks—face critical challenges:
- Supervised fine-tuning (SFT) is highly sample-inefficient for long reasoning chains, as collecting ground-truth rationales is costly and the model may overfit to dataset artifacts or fail to generalize beyond exposure.
- RL methods (e.g., REINFORCE, PPO, GRPO) struggle with combinatorially large output spaces and sparse binary feedback, and tend to reinforce only reasoning paths that the pretrained model already assigns non-negligible likelihood.
Recent literature has highlighted that RL applied directly to sequence models rarely expands their solution space, and primarily reweights known responses rather than eliciting new reasoning capabilities. AdaBack directly addresses this by positioning itself in the intermediate regime: it relaxes the requirements for dense supervision, but avoids naïve end-to-end RL's unproductive exploration.
Method: Adaptive Backtracking (AdaBack)
AdaBack implements a per-example adaptive curriculum: at each training step, only a prefix of the reference rationale is revealed, and the suffix is generated by the model. The length of the revealed prefix is dynamically adjusted for each example, based on recent reward signals. When the model's reward on a given example falls below a threshold, the supervision ratio is increased for that sample; when it exceeds the threshold, supervision is reduced, exposing the model to a longer unassisted completion.
Technical summary:
- For each training instance, maintain an interval [ρmin(i),ρmax(i)] defining allowed supervision ratios.
- At each step, sample ρ(i)∼Uniform(ρmin(i),ρmax(i)), reveal that portion of the rationale, and condition model generation on the partial answer.
- Adjust intervals via a threshold τ on recent average reward: increase supervision if underperforming, otherwise reduce. This operates as a stochastic binary search.
Unlike previous attempts at curriculum RL (e.g., backtracking or static slicing in R3), AdaBack neither requires explicit step boundaries nor handcrafted schedules, making it generalizable to sequence tasks without delimiter cues.
Theoretical and Empirical Claims
A synthetic chain-of-parities task is used to analytically and empirically separate AdaBack from SFT, RL, and their combination. This task—modeled on stepwise parity functions—yields exponentially sparse rewards, making exploration by RL infeasible and requiring sample complexities for SFT that grow super-polynomially with sequence length. AdaBack, by contrast, solves the task efficiently by revealing all but one step early in curriculum, then iteratively backtracking supervision as competence improves.
Empirical results show:
- For L=16 binary inputs with n=1024 train examples, SFT and RL (including R3) do not learn; AdaBack achieves high reward in under 700 iterations (see Figure 1 in the paper).
- The mechanism enables gradient-based learners to focus on lower-complexity subproblems at each stage, leveraging incremental exposure for tractable learning.
Experiments on Real-world Reasoning Datasets
Assessments on MATH and GSM8k benchmarks, along with modified variants (Base-7 numeric, Tensor-2 concatenation), demonstrate that:
- AdaBack consistently exceeds standard RL and SFT+RL pipelines in test accuracy, particularly in generalization-disfavoring settings (e.g., symbolic domain shift, long reasoning chains).
- For example, on Tensor-2 GSM8k, AdaBack achieves 8.5% (1B) and 49.2% (3B) accuracy compared to 0.0% with RL-alone and 6.9/42.7% with SFT+RL.
- In many cases, AdaBack applied to a base model matches or surpasses SFT-initialized models trained with RL, indicating AdaBack supports the emergence of new reasoning trajectories beyond those learned by SFT.
- Pass@k (diversity metric) improvements under AdaBack support the claim that model output distributions are genuinely broadened, not merely reweighted.
Practical Implementation Considerations
For practical deployment in real-world reasoning tasks:
- AdaBack can be integrated atop any RL pipeline with multi-rollout support (e.g., GRPO, PPO) and does not require step boundary annotations.
- Reward models must be able to verify correctness given partial responses. For many domains, this is satisfied via evaluators checking for correctness or answer format.
- AdaBack is robust in low-supervision regimes, making it suitable for settings where full rationales are unavailable or strategies must generalize across symbolic domains.
- Computationally, the adaptive binary search over supervision ratios is efficient and imposes little overhead beyond standard RL; scheduling and memory costs scale linearly with the number of training examples, but can be globalized for very large datasets.
Limitations
- Marginal benefit when the base model has seen most data during pretraining: Experiments with heavily pre-trained/-instructed models (e.g., LLaMA Instruct) on non-novel data show no improvements—AdaBack cannot expand solution space where pretraining/overfitting has saturated it.
- In the high-data regime, per-sample adaptation loses effectiveness as examples are revisited rarely. The authors suggest cluster- or embedding-based adaptation as a scalable alternative.
- Some tasks still require a reward function that is fine-grained enough to guide incremental reasoning; fully outcome-based, noisy, or highly discontinuous rewards may limit curriculum formation.
Implications and Future Directions
Practically, AdaBack lowers the data and engineering burden of developing reasoning-capable LMs for new domains: it enables more effective RL training in the presence of latent multi-step dependencies and data scarcity, and can facilitate the discovery of solution modes inaccessible to prior supervised or RL methods. The approach is generalizable, requiring no curriculum heuristics or step segmentation.
Theoretically, the findings provide a refined understanding of the limits of SFT and RL in sequence modeling. The strong separation between AdaBack and SFT/RL, especially in parity and symbolic generalization, challenges assumptions about what RL alone can achieve in current architectures and posits adaptive curriculum as a critical ingredient for the emergence of compositional reasoning.
Open research directions include scalable per-region adaptation, integration with richer reward models, and extension to unsupervised settings and continuous domains. AdaBack further motivates new benchmarks targeting reasoning compositionality and the incremental emergence of new skills.