RL-then-SFT Coupling in LLMs

Updated 18 January 2026

RL-then-SFT coupling is a training paradigm where a large language model is first optimized via reinforcement learning to maximize reward signals, then refined using supervised fine-tuning.
Theoretical analysis demonstrates that applying SFT after an RL phase degrades the reward objective due to misalignment between supervised and reward-based updates.
Empirical studies confirm that post-RL SFT can sharply reduce achieved rewards, highlighting the need for balanced, integrated multi-objective optimization strategies.

RL-then-SFT Coupling

RL-then-SFT coupling refers to the sequential or interleaved use of reinforcement learning (RL) followed by supervised fine-tuning (SFT) within the post-training pipeline of LLMs. In this paradigm, a model is first adapted or optimized via RL (using reward-based or preference-driven objectives) and then further fine-tuned with supervised learning on curated datasets, typically to reinforce alignment or retain specific behaviors. The theoretical properties, practical consequences, and optimization dynamics of this order of operations present unique features distinct from the classical SFT-then-RL pipeline, with recent work providing both formal lower bounds and empirical validations of their non-trivial coupling (Niu et al., 12 Jan 2026).

1. Mathematical Characterization of RL-then-SFT Coupling

Let $p_\theta$ denote the LLM's conditional output distribution, parameterized by $\theta$ . The RL stage seeks to maximize an expected reward signal, possibly with KL regularization:

$\mathcal{J}_{\mathrm{RL}}(\theta) = \mathbb{E}_{x\sim q,\, y\sim p_\theta(\cdot|x)}[r(x, y)] - \beta\, \mathbb{E}_{x\sim q}\left[ D_{\mathrm{KL}}(p_\theta(\cdot|x)\,\|\,\pi_{\mathrm{ref}}(\cdot|x)) \right],$

where $r(x, y)$ is a reward function and $\pi_{\mathrm{ref}}$ is a reference policy.

After RL, the model parameters $\theta^{(1)}_{\mathrm{RL}}$ encode this adaptation. If one then applies SFT (minimizing cross-entropy on a (potentially disjoint) supervised dataset $\mathcal{D}_{\mathrm{SFT}}$ ), the SFT objective becomes:

$\mathcal{L}_{\mathrm{SFT}}(\theta) = - \frac{1}{n_{\mathrm{SFT}}} \sum_{(x, y) \in \mathcal{D}_{\mathrm{SFT}}} \sum_{j=1}^{|y|} \log p_\theta(y_j|x, y_{<j})$

SFT further updates $\theta^{(1)}_{\mathrm{RL}}$ to $\theta^{(2)}_{\mathrm{SFT}}$ . The core question is: how does this second SFT stage affect the reward performance acquired during RL?

2. Theoretical Non-Decoupling: SFT Degrades RL-Optimal Reward

The crux of (Niu et al., 12 Jan 2026) is a formal demonstration that SFT and RL are non-decoupling—i.e., cannot be treated as modular blocks in either order without loss of prior performance. Specifically, their Theorem 3.1 and extensions show that SFT applied after RL (RL-then-SFT) necessarily leads to a decrease in the previously achieved RL objective.

Main result: If $p_{\theta^{(1)}_{\mathrm{RL}}}$ is RL-optimal (maximizing $\mathcal{J}_{\mathrm{RL}}$ ), then the new SFT-optimal checkpoint $p_{\theta^{(2)}_{\mathrm{SFT}}}$ (obtained via further SFT) satisfies:

$\mathcal{J}_{\mathrm{RL}}(\theta^{(2)}_{\mathrm{SFT}}) = \mathcal{J}_{\mathrm{RL}}(\theta^{(1)}_{\mathrm{RL}}) - C_2(\lambda)$

with

$C_2(\lambda) = \mathbb{E}_{x \sim q} \left[ \log Z_\lambda(x) \right] - \frac{1}{\lambda} \mathbb{E}_{x, y \sim p_{\mathcal{D}_{\mathrm{SFT}}}} [ \log p_{\theta^{(1)}_{\mathrm{RL}}}(y|x) ]$

where $Z_\lambda(x)$ is a normalizer, and $\lambda$ is the trade-off (inverse temperature) parameter.

For typical choices of SFT and RL distributions and objectives, $C_2(\lambda) \geq 0$ , so SFT after RL will lower the achieved reward. Unless the SFT dataset and the RL reward are perfectly aligned (a degenerate case), this degradation is unavoidable.

3. Empirical Evidence of RL-then-SFT Degradation

Experimental results on Qwen3-0.6B and the CoLA acceptability dataset validate the above theory (Niu et al., 12 Jan 2026). When SFT is applied after reward-maximizing RL:

The expected reward (e.g., correct classification as measured by $r(x, y)$ ) drops sharply at the onset of SFT.
Continued SFT further erodes the reward, sometimes reducing the performance below even the pre-trained base model.
This degradation persists even when SFT loss continues to decrease on its own objective—highlighting the misalignment between the supervised and reward-driven signals.

4. Mechanistic Explanation: Probabilistic and Optimization Perspective

The analytical reason for RL-then-SFT incompatibility is the "Gibbs re-weighting" structure of the RL-optimal solution. RL tilts the SFT-trained policy by $\exp(r/\beta)$ , concentrating mass on high-reward outputs not necessarily favored by the SFT distribution. SFT, in turn, reallocates probability mass back toward the empirical supervised likelihood, which (except in degenerate alignment) must undo the reward-based concentration.

Thus, any SFT update—unless perfectly aligned with the RL-induced distribution—inevitably transfers mass away from high-reward samples, and the expected reward decreases monotonically under standard SFT training.

5. Practical Implications and Recommendations

The non-decoupling result of RL-then-SFT coupling has immediate consequences for model design and post-training protocols:

RL and SFT cannot be safely composed as plug-and-play blocks. Each must be aware of the other's objective, with proper multi-objective or alternating optimization.
If SFT must follow RL (for integration of new demonstrations or behavior correction), it should be done with mild weighting, or with a loss composed to balance reward and supervised signals:

$\min_\theta \lambda_1 \mathcal{L}_{\mathrm{SFT}}(\theta) - \lambda_2 \mathcal{J}_{\mathrm{RL}}(\theta)$

Monitoring of the reward objective during SFT is critical. Any sharp drop signals undesirable overwriting of RL-acquired adaptation.

Table: Consequences of RL–then–SFT Coupling

Order	What is Optimized	What Degrades
SFT → RL	RL reward (at cost of SFT)	SFT cross-entropy loss
RL → SFT	SFT loss (at cost of RL)	RL reward

Most published high-performing pipelines avoid strict RL-then-SFT or SFT-then-RL. Instead, they employ:

Interleaved or joint training: SRFT and bilevel optimization methods (e.g., BRIDGE (Chen et al., 8 Sep 2025)) maintain the influence of both signals throughout, controlling trade-offs with explicit coefficients.
Dynamic or curriculum-inspired routing of data or objectives (e.g., PRISM (Zhao et al., 12 Jan 2026), SASR (Chen et al., 19 May 2025)) such that each example is assigned to the regime most suited to its structure, as judged by internal model signals (e.g., gradient concentration or loss dynamics).
Alternating multi-objective optimization to maintain Pareto efficiency, selecting schedules, or adaptive mixing based on gradient norms or entropy.

All these approaches sidestep the in-practice unsolvable conflict of hard-staged RL-then-SFT, instead seeking to preserve reward and supervised alignment in a principled, coordinated fashion.

7. Summary and Perspectives

RL-then-SFT coupling is a fundamentally non-separable, antagonistic interaction in LLM post-training and agent alignment. Formally, applying SFT after RL-optimal adaptation will always incur a loss of previously attained reward unless the objectives are perfectly matched. This makes naïve RL-then-SFT sequencing unsuitable for scalable, robust agent alignment. Modern frameworks achieve superior robustness and efficiency by diagnostic routing, dynamic loss trade-off, and joint optimization. These findings compel practitioners to design RL and SFT pipelines not as modular black boxes, but as intrinsically coupled, mutually influencing processes, best optimized by explicit multi-objective or bilevel formulations (Niu et al., 12 Jan 2026, Zhao et al., 12 Jan 2026, Chen et al., 8 Sep 2025).