How Much Backtracking is Enough? Exploring the Interplay of SFT and RL in Enhancing LLM Reasoning

Published 30 May 2025 in cs.AI | (2505.24273v1)

Abstract: Recent breakthroughs in LLMs have effectively improved their reasoning abilities, particularly on mathematical and logical problems that have verifiable answers, through techniques such as supervised finetuning (SFT) and reinforcement learning (RL). Prior research indicates that RL effectively internalizes search strategies, enabling long chain-of-thought (CoT) reasoning, with backtracking emerging naturally as a learned capability. However, the precise benefits of backtracking, specifically, how significantly it contributes to reasoning improvements and the optimal extent of its use, remain poorly understood. In this work, we systematically investigate the dynamics between SFT and RL on eight reasoning tasks: Countdown, Sudoku, Arc 1D, Geometry, Color Cube Rotation, List Functions, Zebra Puzzles, and Self Reference. Our findings highlight that short CoT sequences used in SFT as a warm-up do have moderate contribution to RL training, compared with cold-start RL; however such contribution diminishes when tasks become increasingly difficult. Motivated by this observation, we construct synthetic datasets varying systematically in the number of backtracking steps and conduct controlled experiments to isolate the influence of either the correctness (content) or the structure (i.e., backtrack frequency). We find that (1) longer CoT with backtracks generally induce better and more stable RL training, (2) more challenging problems with larger search space tend to need higher numbers of backtracks during the SFT stage. Additionally, we demonstrate through experiments on distilled data that RL training is largely unaffected by the correctness of long CoT sequences, suggesting that RL prioritizes structural patterns over content correctness. Collectively, our results offer practical insights into designing optimal training strategies to effectively scale reasoning in LLMs.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that integrating backtracking in SFT warm-ups enhances RL efficacy on tasks like Sudoku and Countdown.
The study reveals that RL emphasizes structural chain-of-thought patterns over content accuracy, refining search strategies.
Experiments show that increased backtracking in training significantly improves performance on more complex reasoning challenges.

Enhancing LLM Reasoning: Interplay of SFT and RL

The paper explores the interaction between supervised finetuning (SFT) and reinforcement learning (RL) in enhancing the reasoning abilities of LLMs on eight distinct reasoning tasks. The focus is on understanding the role and extent of backtracking in optimizing strategies during these processes.

Introduction

The investigation explores the reasoning improvements facilitated by long chain-of-thought (CoT) sequences in LLMs. Prior efforts in RL for LLMs emphasize their potential to strengthen search strategy internalization, manifesting in sophisticated reasoning traces. The central enquiry focuses on delineating the contribution of backtracking to reasoning efficacy and identifying optimal backtracking frequencies to improve LLM performance.

Figure 1: We perform controlled post-training pipeline study by curating synthetic datasets for Sudoku, Countdown, and Arc 1D tasks, varying the number of backtracks.

Methodology

The methodology involves a systematic examination of training data mixtures combining SFT and RL across varied backtracking steps in synthetic datasets. The study encompasses eight reasoning tasks, emphasizing the influence of backtracking on RL performance and the utility of SFT warm-ups. The tasks include Countdown, Sudoku, and Arc 1D, among others.

Reasoning Tasks

Tasks examined range from constructing arithmetic expressions in Countdown to determining transformations in Arc 1D tasks, showcasing diverse problem complexities.

Model Training

Qwen2.5-3B-Instruct, an LLM variant, serves as the basis for SFT and subsequent RL training. The choice of rollout lengths varies depending on task complexity, ensuring thorough exploration of task-specific reasoning patterns during training.

Results

Self-sampled SFT and RL Enhancements

Self-sampled short CoTs, when used in SFT, enhance RL training performance. However, they show diminishing returns on tasks solvable via classic search algorithms. The study underscores the amplification role of RL in leveraging exemplary reasoning patterns from pretraining.

Performance Insensitivity to CoT Correctness

Whether the CoTs are correct or incorrect minimally affects final RL performance. This suggests that RL focuses more on structural patterns during training rather than content correctness, a significant insight for dataset preparation.

Figure 2: Response length comparison between List Functions and Countdown.

Backtracking Necessity in Difficult Tasks

Controlled experiments demonstrate that problem difficulty scales with the need for backtracking demonstrations. Complex tasks benefit more substantially from increased backtracking steps in SFT data, enhancing RL training efficacy.

Conclusion

The interplay between SFT and RL, modulated by backtracking, holds promise for advancing LLM reasoning abilities. The findings advocate for calibrated SFT stages, incorporating appropriate backtracking levels tailored per task difficulty, thereby optimizing RL training outcomes. Future research could extend these insights to larger models and more challenging reasoning datasets, further exploring data scaling and pretraining strategies in AI.

Markdown Report Issue