- The paper establishes that supervised fine-tuning on curated data inherently optimizes a lower bound of the reinforcement learning objective.
- It introduces an importance weighting variant, iw-SFT, which reweights data quality to achieve tighter alignment with RL goals and practical improvements.
- Experimental results demonstrate competitive performance, including a 66.7% score on AIME 2024, highlighting the method's potential in LLM reasoning and control tasks.
Supervised Fine Tuning as Reinforcement Learning
Introduction
The paper "Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved)" (2507.12856) investigates the conceptual convergence between supervised fine-tuning (SFT) and reinforcement learning (RL) frameworks, particularly in context of LLMs and control policies. While behavior cloning through SFT on curated datasets is traditionally viewed as a distinct approach from RL, the authors assert that SFT can be reframed as a mechanism that optimizes a lower bound of the RL objective. By acknowledging this perspective, the paper offers enhancements to SFT via methods like importance weighted supervised fine-tuning (iw-SFT), achieving closer alignment with RL objectives and improved performance in practical applications.
Connection Between SFT and RL
The core assertion of the paper is that SFT on filtered datasets inherently maximizes a lower bound of the RL objective, particularly in sparse reward settings. This perspective is based on pre-existing literature that presents RL as inference and iterative maximum likelihood supervised learning approaches, such as reward-weighted regression (RWR). By recognizing SFT as optimizing for RL objectives, the authors extend its applicability and effectiveness, providing theoretical clarity to its observed practical performance. This connection is crucial as post-training strategies for LLMs continue to evolve.
Importance Weighted Supervised Fine-Tuning
The paper introduces importance weighted supervised fine-tuning (iw-SFT), a variant that modifies traditional SFT by incorporating importance weighting. This iw-SFT approach seeks to tighten the lower bound in the RL framework by sampling data proportional to quality scores and reweighting the maximum likelihood objective. This adaptation provides several benefits:
- Tighter Approximation to RL: By reweighting data based on quality and aligned preferences, iw-SFT optimizes a tighter bound reflective of RL goals, potentially surpassing regular SFT performance.
- Simple Implementation: The proposed method is straightforward to implement, requiring minor modifications to the existing SFT framework.
- Competitive Performance: Experiments highlighted in the paper show that iw-SFT can outperform standard SFT approaches, achieving high benchmarks on reasoning tasks and continuous control domains, such as 66.7% on the AIME 2024 dataset.
Experimental Evaluation
The efficacy of iw-SFT was tested across various benchmarks and settings, demonstrating competitive or superior performance compared to advanced RL strategies:
- LLM Reasoning: LLMs trained with iw-SFT exhibit enhanced reasoning abilities, reflecting the tight optimization bound closer to the RL objective. Notably, the approach achieved performance improvements in datasets characterized by reasoning tasks.
- Continuous Control Tasks: In offline RL settings, iw-SFT proved competitive with state-of-the-art algorithms such as AWAC and IQL. The strategy offers versatility across different domains beyond just human alignment tasks in LLMs, showcasing its utility in training policies for dynamical tasks on the D4RL benchmark.
Future Implications
This work points to significant implications for improving AI model post-training strategies. The theoretical underpinning that relates SFT to RL could lead to more refined alignment protocols that balance stability with performance gains. Future developments may extend these bounding techniques to more complex and diverse datasets, exploring optimizations that leverage user-driven feedback and adaptive learning.
Conclusion
The paper comprehensively elucidates the relationship between supervised fine-tuning and reinforcement learning, proposing viable enhancements through importance weighting. By reframing the SFT process as an RL-influenced optimization, researchers can achieve finer control over model alignment processes that integrate dynamic data handling techniques. This work opens avenues for hybrid approaches in AI model training, establishing foundational insights that guide future exploration of model optimization and reasoning capabilities in large-scale AI systems.