RL-then-SFT Coupling in LLMs
- RL-then-SFT coupling is a training paradigm where a large language model is first optimized via reinforcement learning to maximize reward signals, then refined using supervised fine-tuning.
- Theoretical analysis demonstrates that applying SFT after an RL phase degrades the reward objective due to misalignment between supervised and reward-based updates.
- Empirical studies confirm that post-RL SFT can sharply reduce achieved rewards, highlighting the need for balanced, integrated multi-objective optimization strategies.
RL-then-SFT Coupling
RL-then-SFT coupling refers to the sequential or interleaved use of reinforcement learning (RL) followed by supervised fine-tuning (SFT) within the post-training pipeline of LLMs. In this paradigm, a model is first adapted or optimized via RL (using reward-based or preference-driven objectives) and then further fine-tuned with supervised learning on curated datasets, typically to reinforce alignment or retain specific behaviors. The theoretical properties, practical consequences, and optimization dynamics of this order of operations present unique features distinct from the classical SFT-then-RL pipeline, with recent work providing both formal lower bounds and empirical validations of their non-trivial coupling (Niu et al., 12 Jan 2026).
1. Mathematical Characterization of RL-then-SFT Coupling
Let denote the LLM's conditional output distribution, parameterized by . The RL stage seeks to maximize an expected reward signal, possibly with KL regularization:
where is a reward function and is a reference policy.
After RL, the model parameters encode this adaptation. If one then applies SFT (minimizing cross-entropy on a (potentially disjoint) supervised dataset ), the SFT objective becomes:
SFT further updates to . The core question is: how does this second SFT stage affect the reward performance acquired during RL?
2. Theoretical Non-Decoupling: SFT Degrades RL-Optimal Reward
The crux of (Niu et al., 12 Jan 2026) is a formal demonstration that SFT and RL are non-decoupling—i.e., cannot be treated as modular blocks in either order without loss of prior performance. Specifically, their Theorem 3.1 and extensions show that SFT applied after RL (RL-then-SFT) necessarily leads to a decrease in the previously achieved RL objective.
Main result: If is RL-optimal (maximizing ), then the new SFT-optimal checkpoint (obtained via further SFT) satisfies:
with
where is a normalizer, and is the trade-off (inverse temperature) parameter.
For typical choices of SFT and RL distributions and objectives, , so SFT after RL will lower the achieved reward. Unless the SFT dataset and the RL reward are perfectly aligned (a degenerate case), this degradation is unavoidable.
3. Empirical Evidence of RL-then-SFT Degradation
Experimental results on Qwen3-0.6B and the CoLA acceptability dataset validate the above theory (Niu et al., 12 Jan 2026). When SFT is applied after reward-maximizing RL:
- The expected reward (e.g., correct classification as measured by ) drops sharply at the onset of SFT.
- Continued SFT further erodes the reward, sometimes reducing the performance below even the pre-trained base model.
- This degradation persists even when SFT loss continues to decrease on its own objective—highlighting the misalignment between the supervised and reward-driven signals.
4. Mechanistic Explanation: Probabilistic and Optimization Perspective
The analytical reason for RL-then-SFT incompatibility is the "Gibbs re-weighting" structure of the RL-optimal solution. RL tilts the SFT-trained policy by , concentrating mass on high-reward outputs not necessarily favored by the SFT distribution. SFT, in turn, reallocates probability mass back toward the empirical supervised likelihood, which (except in degenerate alignment) must undo the reward-based concentration.
Thus, any SFT update—unless perfectly aligned with the RL-induced distribution—inevitably transfers mass away from high-reward samples, and the expected reward decreases monotonically under standard SFT training.
5. Practical Implications and Recommendations
The non-decoupling result of RL-then-SFT coupling has immediate consequences for model design and post-training protocols:
- RL and SFT cannot be safely composed as plug-and-play blocks. Each must be aware of the other's objective, with proper multi-objective or alternating optimization.
- If SFT must follow RL (for integration of new demonstrations or behavior correction), it should be done with mild weighting, or with a loss composed to balance reward and supervised signals:
- Monitoring of the reward objective during SFT is critical. Any sharp drop signals undesirable overwriting of RL-acquired adaptation.
Table: Consequences of RL–then–SFT Coupling
| Order | What is Optimized | What Degrades |
|---|---|---|
| SFT → RL | RL reward (at cost of SFT) | SFT cross-entropy loss |
| RL → SFT | SFT loss (at cost of RL) | RL reward |
6. Connections to Related Hybrid and Adaptive Paradigms
Most published high-performing pipelines avoid strict RL-then-SFT or SFT-then-RL. Instead, they employ:
- Interleaved or joint training: SRFT and bilevel optimization methods (e.g., BRIDGE (Chen et al., 8 Sep 2025)) maintain the influence of both signals throughout, controlling trade-offs with explicit coefficients.
- Dynamic or curriculum-inspired routing of data or objectives (e.g., PRISM (Zhao et al., 12 Jan 2026), SASR (Chen et al., 19 May 2025)) such that each example is assigned to the regime most suited to its structure, as judged by internal model signals (e.g., gradient concentration or loss dynamics).
- Alternating multi-objective optimization to maintain Pareto efficiency, selecting schedules, or adaptive mixing based on gradient norms or entropy.
All these approaches sidestep the in-practice unsolvable conflict of hard-staged RL-then-SFT, instead seeking to preserve reward and supervised alignment in a principled, coordinated fashion.
7. Summary and Perspectives
RL-then-SFT coupling is a fundamentally non-separable, antagonistic interaction in LLM post-training and agent alignment. Formally, applying SFT after RL-optimal adaptation will always incur a loss of previously attained reward unless the objectives are perfectly matched. This makes naïve RL-then-SFT sequencing unsuitable for scalable, robust agent alignment. Modern frameworks achieve superior robustness and efficiency by diagnostic routing, dynamic loss trade-off, and joint optimization. These findings compel practitioners to design RL and SFT pipelines not as modular black boxes, but as intrinsically coupled, mutually influencing processes, best optimized by explicit multi-objective or bilevel formulations (Niu et al., 12 Jan 2026, Zhao et al., 12 Jan 2026, Chen et al., 8 Sep 2025).