RL-then-SFT Coupling Dynamics

Updated 18 January 2026

RL-then-SFT Coupling is a sequential fine-tuning approach that applies reinforcement learning to maximize rewards followed by supervised fine-tuning to consolidate expert behaviors.
Its methodology reveals inherent objective misalignment, as RL-induced exploration often undermines SFT-imposed consistency through cross-entropy loss minimization.
Empirical studies show that SFT following RL degrades reward quality and behavioral diversity, highlighting the need for joint or interleaved optimization strategies.

Reinforcement Learning (RL)-then-Supervised Fine-Tuning (SFT) Coupling refers to the practice of applying supervised fine-tuning after an initial reinforcement learning phase in the post-training pipeline of LLMs. Unlike the conventional SFT→RL paradigm, RL-then-SFT coupling is hypothesized as a route to compensate for RL-induced behaviors (e.g., reward maximization, exploration) with subsequent consolidation of patterns via SFT. The theoretical and empirical aspects of RL-then-SFT coupling, particularly its interplay with RL optimization regimes, objective misalignment, and its limitations, have been rigorously analyzed in recent work.

1. Mathematical Formalism and Objective Misalignment

RL-then-SFT coupling involves two distinct optimization steps:

Stage 1: RL fine-tuning

$\mathcal{J}_{\mathrm{RL}}(\theta) = \mathbb{E}_{\bm{x}\sim q,\;\bm{y}\sim p_\theta(\cdot \mid \bm{x})} [ r(\bm{x},\bm{y}) ] - \beta\,\mathbb{E}_{\bm{x}\sim q} [ D_{\mathrm{KL}}(p_\theta(\cdot \mid \bm{x}) \| \pi_{\mathrm{ref}}(\cdot \mid \bm{x})) ]$

This objective maximizes expected reward for sampled outputs and includes (optionally) a regularization term to prevent excessive drift from a reference policy.

Stage 2: SFT fine-tuning

$\mathcal{L}_{\mathrm{SFT}}(\theta) = -\mathbb{E}_{\bm{x},\bm{y}\sim \mathcal{D}_{\mathrm{SFT}}} [\log p_\theta(\bm{y} \mid \bm{x}) ]$

This standard cross-entropy loss enforces imitation of expert data.

The underlying optimization surfaces associated with RL and SFT are structurally misaligned: RL updates frequently break SFT-optimized likelihoods to explore high-reward regions, while SFT enforces consistency with expert tokens, often reducing reward optimization. Recent theoretical analysis establishes that these losses cannot be simultaneously minimized unless reward and data likelihood are perfectly aligned, which is rare in practical corpora (Niu et al., 12 Jan 2026).

2. Non-Decoupling Theorem and Proof Outline

A rigorous coupling theorem is provided in “On the Non-decoupling of Supervised Fine-Tuning and Reinforcement Learning in Post-training” (Niu et al., 12 Jan 2026). The authors prove that decoupling RL and SFT is impossible: any subsequent SFT update following RL necessarily decreases the reward achieved by the RL-trained policy.

Let $\theta^{(1)}_{\mathrm{RL}}$ be the RL-optimized checkpoint, and $\theta^{(2)}_{\mathrm{SFT}}$ be the next SFT-updated checkpoint. Under the RL reward-maximizing solution, any SFT update (that minimizes cross-entropy) will necessarily shift probability mass away from reward-optimal outputs, causing the expected reward to decrease:

$\mathcal{J}_{\mathrm{RL}}(\theta^{(2)}_{\mathrm{SFT}}) < \mathcal{J}_{\mathrm{RL}}(\theta^{(1)}_{\mathrm{RL}})$

The only case in which reward is unaffected is when the SFT ground-truth distribution is perfectly aligned with the output distribution maximizing $r(\bm{x},\bm{y})$ . The proof leverages Jensen’s inequality and the convexity of the KL-divergence and log-partition functions associated with Gibbs-weighted output distributions. Empirical verification on Qwen3-0.6B demonstrates that running SFT after RL sharply decreases the RL reward on the validation set, matching the theoretical lower bound.

3. Optimization Dynamics and Empirical Effects

The practical consequence of sequential RL→SFT is a degradation of reward-maximized behaviors. In experimental protocols, running SFT after RL results in:

A consistent rise in SFT validation accuracy (due to enforced imitation).
A simultaneous drop in post-SFT RL reward, often to levels below the RL-only baseline (Niu et al., 12 Jan 2026).
A collapse of exploratory behaviors induced during RL (diverse or high-reward outputs).
Increased cross-entropy loss when retested on the RL-optimized outputs.

This phenomenon is observed across model scales, datasets, and both token-level and trajectory-level reward schemes.

4. Mechanism: Distributional Drift and Parameter Re-alignment

At the parameter level, post-RL SFT initiates a “hard re-alignment” toward the SFT ground-truth manifold, erasing probability mass situated in reward-rich but low-likelihood regions. This is analogous to a “consolidation override”: SFT’s greedy optimization shrinks the support of the output distribution, limiting policy entropy and diversity, and undermining RL-induced adaptations (Zhao et al., 12 Jan 2026).

Experiments confirm that RL-optimized probability vectors and singular vectors of parameter matrices experience aggressive rotation during subsequent SFT, resulting in loss of OOD generalization and reward-driven exploration (Jin et al., 8 Sep 2025). The outcome is a model that displays precise imitation but reduced problem-solving novelty.

5. Guidelines, Alternatives, and Joint Optimization

Given the inevitability of optimization interference in RL-then-SFT coupling, the recommended practice is to employ joint or simultaneous optimization regimes. Examples include:

Pareto-optimal loss scheduling, e.g.,

$\min_\theta \lambda_1 \mathcal{L}_{\mathrm{SFT}}(\theta) - \lambda_2 \mathcal{J}_{\mathrm{RL}}(\theta)$

with adaptive balancing (Zhao et al., 12 Jan 2026, Chen et al., 19 May 2025).

Interleaved mini-batch updates, alternating SFT and RL steps in each gradient pass.
Gradient-concentration/right regime routing: only select high-conflict samples for RL and diffuse samples for SFT as proposed by PRISM (Zhao et al., 12 Jan 2026).
Entropy-aware weighting: downweight SFT terms when policy entropy indicates high certainty, or vice versa as in SRFT (Fu et al., 24 Jun 2025).

Performance tracking should monitor both cross-entropy and reward curves, stopping or annealing when substantial interference is detected.

6. Implications for Model Design and Agent Alignment

RL-then-SFT coupling directly impacts agent behavior: sequential pipelines that attempt to clean up or consolidate RL-induced outputs with SFT fail to preserve the reward-maximized states and can reintroduce “imitative rigidity” or catastrophic forgetting of exploratory adaptations. This has implications for robustness, generalization, and scalable alignment in LLM agents. The theoretical impossibility of decoupling RL and SFT underscores the need for advanced dynamics-aware arbitration frameworks (e.g., PRISM) and meta-optimization protocols to maintain both behavioral flexibility and targeted imitation (Zhao et al., 12 Jan 2026).

7. Summary Table: RL-Then-SFT Coupling Key Effects

Protocol	RL Reward After SFT	SFT Validation Accuracy	Behavioral Diversity	Risk of Forgetting
RL-only	High	Low/unchanged	High	Moderate
RL then SFT	Decreases sharply	High	Collapses	High
Joint SFT–RL (PRISM)	Balanced	High	Preserved	Low
Interleaved SFT–RL	Intermediate	Intermediate	Mixed	Moderate

RL-then-SFT coupling cannot be leveraged for performance improvement unless reward objectives and SFT data are strictly aligned, which is unrealistic in most agent post-training scenarios (Niu et al., 12 Jan 2026). Direct consolidation via SFT after RL negates much of the beneficial adaptation accomplished during RL and should be replaced with more nuanced regime arbitration and joint fine-tuning strategies.