Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved)

Published 17 Jul 2025 in cs.LG and cs.AI | (2507.12856v2)

Abstract: Behavior Cloning (BC) on curated (or filtered) data is the predominant paradigm for supervised fine-tuning (SFT) of LLMs; as well as for imitation learning of control policies. Here, we draw on a connection between this successful strategy and the theory and practice of finding optimal policies via Reinforcement Learning (RL). Building on existing literature, we clarify that SFT can be understood as maximizing a lower bound on the RL objective in a sparse reward setting. Giving support to its often observed good performance. From this viewpoint, we realize that a small modification to SFT leads to an importance weighted variant that behaves closer to training with RL as it: i) optimizes a tighter bound to the RL objective and, ii) can improve performance compared to SFT on curated data. We refer to this variant as importance weighted supervised fine-tuning (iw-SFT). We show that it is easy to implement and can be further generalized to training with quality scored data. The resulting SFT variants are competitive with more advanced RL algorithms for LLMs and for training policies in continuous control tasks. For example achieving 66.7% on the AIME 2024 dataset.

Abstract PDF Upgrade to Chat

Summary

The paper establishes that supervised fine-tuning on curated data inherently optimizes a lower bound of the reinforcement learning objective.
It introduces an importance weighting variant, iw-SFT, which reweights data quality to achieve tighter alignment with RL goals and practical improvements.
Experimental results demonstrate competitive performance, including a 66.7% score on AIME 2024, highlighting the method's potential in LLM reasoning and control tasks.

Supervised Fine Tuning as Reinforcement Learning

Introduction

The paper "Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved)" (2507.12856) investigates the conceptual convergence between supervised fine-tuning (SFT) and reinforcement learning (RL) frameworks, particularly in context of LLMs and control policies. While behavior cloning through SFT on curated datasets is traditionally viewed as a distinct approach from RL, the authors assert that SFT can be reframed as a mechanism that optimizes a lower bound of the RL objective. By acknowledging this perspective, the paper offers enhancements to SFT via methods like importance weighted supervised fine-tuning (iw-SFT), achieving closer alignment with RL objectives and improved performance in practical applications.

Connection Between SFT and RL

The core assertion of the paper is that SFT on filtered datasets inherently maximizes a lower bound of the RL objective, particularly in sparse reward settings. This perspective is based on pre-existing literature that presents RL as inference and iterative maximum likelihood supervised learning approaches, such as reward-weighted regression (RWR). By recognizing SFT as optimizing for RL objectives, the authors extend its applicability and effectiveness, providing theoretical clarity to its observed practical performance. This connection is crucial as post-training strategies for LLMs continue to evolve.

Importance Weighted Supervised Fine-Tuning

The paper introduces importance weighted supervised fine-tuning (iw-SFT), a variant that modifies traditional SFT by incorporating importance weighting. This iw-SFT approach seeks to tighten the lower bound in the RL framework by sampling data proportional to quality scores and reweighting the maximum likelihood objective. This adaptation provides several benefits:

Tighter Approximation to RL: By reweighting data based on quality and aligned preferences, iw-SFT optimizes a tighter bound reflective of RL goals, potentially surpassing regular SFT performance.
Simple Implementation: The proposed method is straightforward to implement, requiring minor modifications to the existing SFT framework.
Competitive Performance: Experiments highlighted in the paper show that iw-SFT can outperform standard SFT approaches, achieving high benchmarks on reasoning tasks and continuous control domains, such as 66.7% on the AIME 2024 dataset.

Experimental Evaluation

The efficacy of iw-SFT was tested across various benchmarks and settings, demonstrating competitive or superior performance compared to advanced RL strategies:

LLM Reasoning: LLMs trained with iw-SFT exhibit enhanced reasoning abilities, reflecting the tight optimization bound closer to the RL objective. Notably, the approach achieved performance improvements in datasets characterized by reasoning tasks.
Continuous Control Tasks: In offline RL settings, iw-SFT proved competitive with state-of-the-art algorithms such as AWAC and IQL. The strategy offers versatility across different domains beyond just human alignment tasks in LLMs, showcasing its utility in training policies for dynamical tasks on the D4RL benchmark.

Future Implications

This work points to significant implications for improving AI model post-training strategies. The theoretical underpinning that relates SFT to RL could lead to more refined alignment protocols that balance stability with performance gains. Future developments may extend these bounding techniques to more complex and diverse datasets, exploring optimizations that leverage user-driven feedback and adaptive learning.

Conclusion

The paper comprehensively elucidates the relationship between supervised fine-tuning and reinforcement learning, proposing viable enhancements through importance weighting. By reframing the SFT process as an RL-influenced optimization, researchers can achieve finer control over model alignment processes that integrate dynamic data handling techniques. This work opens avenues for hybrid approaches in AI model training, establishing foundational insights that guide future exploration of model optimization and reasoning capabilities in large-scale AI systems.

Markdown Report Issue