SFT-then-PPO Paradigm

Updated 10 January 2026

The SFT-then-PPO paradigm is a two-stage post-training process where supervised fine-tuning first aligns model responses before PPO restores and enhances out-of-distribution reasoning.
It rebalances model performance by countering SFT-induced over-specialization through reward-driven adjustments that recover lost generalization capabilities.
Practical implementations leverage efficient techniques and variants like PSFT and self-rewarding PPO, optimizing memory use and computational cost across diverse tasks.

The SFT-then-PPO paradigm refers to a two-stage pipeline for post-training LLMs and vision-LLMs (VLMs), in which standard supervised fine-tuning (SFT) on labeled data precedes on-policy reinforcement learning (RL) using Proximal Policy Optimization (PPO) or its variants. This workflow forms the empirical and methodological backbone of RLHF and related alignment regimes; it is widely used for improving alignment with human preferences, enhancing reasoning robustness, and adapting base models to new domains or user tasks.

1. The SFT-then-PPO Pipeline: Definition and Standard Workflow

The classical SFT-then-PPO pipeline consists of two sequential post-training phases:

Supervised Fine-Tuning (SFT): The model is finetuned with cross-entropy loss on task-annotated input–output pairs $(x, y)$ :

$L_{\rm SFT}(\theta) = -\mathbb{E}_{(x,y)\sim D}[\log p_\theta(y|x)]$

SFT rapidly aligns the model to the data distribution and target format.

Proximal Policy Optimization (PPO): The SFT checkpoint is further finetuned using an external reward $R$ —derived from a reward model, environment, or other feedback—via the PPO clipped surrogate loss:

$L_{\rm PPO}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta)A_t, \; \mathrm{clip}(r_t(\theta),\,1-\epsilon,\,1+\epsilon)A_t\right)\right]$

with $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\rm old}}(a_t|s_t)}$ and $A_t$ estimated via generalized advantage estimation (GAE).

This pipeline is consistently observed in text-only LLMs (Huang et al., 2024, Zhang et al., 24 Oct 2025), VLMs (Zhu et al., 2024, Chen et al., 10 Apr 2025), and aligns with architectures in which PPO is interpreted as preference optimization or reward maximization under trust-region constraints. The reward may derive from explicit human preferences, automatic task metrics, or even internal reward constructions as in self-rewarding PPO (Zhang et al., 24 Oct 2025).

2. Mechanisms and Role of Each Stage

2.1 Supervised Fine-Tuning: Alignment and Format Induction

SFT performs rapid format alignment, minimizing in-distribution (ID) cross-entropy and bringing outputs into the desired structure. This stage finds an "aligned" but typically narrow optimum on the target distribution. However, continued SFT causes the phenomenon of OOD (out-of-distribution) forgetting: OOD generalization metrics peak at an early SFT checkpoint and then degrade, even while ID metrics continue to improve. This loss of OOD capability has been attributed to hard rotation of key singular-vector directions in model weight matrices, rather than shifts in spectral capacity (Jin et al., 8 Sep 2025).

2.2 PPO: Reward Maximization, OOD Restoration, and Generalization

PPO, when initialized from a well-aligned SFT checkpoint, serves a restorative role:

It recovers OOD reasoning performance lost during prolonged SFT, rather than inventing new OOD behavior.
Empirically, PPO rotates the singular vectors of key layers back towards their pre-SFT configuration, softening alignment and restoring diversity (Jin et al., 8 Sep 2025).
The boundaries for effective restoration are sharp: if SFT trains for too short or too long, PPO cannot recover lost OOD performance.

This challenges the notion that "SFT memorizes, RL generalizes"; the mechanisms are better described as SFT specializing and PPO re-balancing generalized capabilities.

3. Practical Implementations and Engineering Details

A standard deployment of SFT-then-PPO in large models requires careful management of memory and compute, with multiple parallel models (policy, critic, reference, reward) in PPO:

PPO Component	Function	Scaling
Actor	Generative policy π_θ	Full
Critic	Value estimator V_ψ	Full
Reference	PPO KL anchor π_ref	Frozen
Reward	Learned reward r_φ	Frozen

In practice, this increases memory requirements 3–4× over SFT (Santacroce et al., 2023), making innovations such as Hydra-RLHF—which unifies model heads and exploits LoRA adapters—highly impactful, reducing PPO memory to parity or better than SFT (Santacroce et al., 2023).

The TL;DR RLHF pipeline (Huang et al., 2024) exemplifies detailed recipe engineering for SFT→RM→PPO: frozen SFT model and reward head, AdamW optimization, reward normalization, reward/advantage whitening, strict prompt handling, and KL penalties. Batch sizes, advantage whitening, and KL coefficients are tuned to maintain stable PPO optimization and avoid reward hacking or catastrophic forgetting.

4. Variants and Extensions: Trust-Region SFT, Self-Rewarding PPO, and Preference Optimization

4.1 Proximal SFT (PSFT)

Inspired by TRPO/PPO, PSFT replaces pure cross-entropy in SFT with a PPO-style clipped surrogate:

$L_{\rm PSFT}(\theta) = -\mathbb{E}_D\left[\min\left(r_t(\theta), \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\right)\right]$

with $r_t(\theta) = \pi_\theta(a|s)/\pi_{\theta_{\rm old}}(a|s)$ . PSFT improves entropy stability, OOD generalization, and leaves more room for downstream RL optimization (Zhu et al., 25 Aug 2025).

4.2 Self-Rewarding PPO

Self-Rewarding PPO (Zhang et al., 24 Oct 2025) aligns solely using demonstration data: the reward is the log-policy ratio $\log\pi_\mathrm{SFT}(y|x)-\log\pi_\mathrm{base}(y|x)$ , using the base model as a baseline. PPO maximizes this coherent reward, achieving gains in generalization and data efficiency without preference labels. This approach is particularly effective in low-data regimes and where reward annotation is infeasible.

4.3 Preference-Based Optimization: DPO/ThinkPO

When human preference data is available, direct preference optimization (e.g., DPO) replaces on-policy PPO with a ranking loss + KL divergence between positive (preferred) and negative (rejected) outputs post-SFT:

$L_{\mathrm{pref}}(\theta) = -\mathbb{E}[\log \sigma(r_\theta(q, o^+)-r_\theta(q, o^-))]$

with $r_\theta(q, o) = \log\pi_\theta(o|q) - \log\pi_\mathrm{ref}(o|q)$ (Yang et al., 17 Feb 2025). DPO is more stable and compute-efficient due to its offline nature and has become a prominent variant in the SFT-then-RLHF family.

5. Limitations, Failure Modes, and Design Considerations

5.1 Cold Start and Data Regimes

Empirical evidence demonstrates that eliminating SFT and applying PPO or DPO directly to a base model leads to poor performance—this is the "cold start" problem (Raghavendra et al., 16 Feb 2025). Allocating even $<10\%$ of annotation budget to SFT can yield $15-20\%$ accuracy improvements on tasks like GSM8K. In low-budget regimes ( $<1,000$ examples), SFT alone dominates; as budget increases, shifting more toward preference/RL data optimizes performance.

5.2 Overfitting, "Pseudo-Reasoning," and OOD Collapse

In VL-LMs (Chen et al., 10 Apr 2025), SFT can induce "pseudo-reasoning paths," prolonging and distorting reasoning without true improvements. RL-only pipelines with mixed or carefully designed rewards yield more authentic, concise, and effective reasoning traces. Over-specialization in SFT may collapse policy entropy, making PPO ineffective at restoring OOD capabilities (Jin et al., 8 Sep 2025).

5.3 Efficiency: Memory, Throughput, and Computational Cost

Standard PPO scales memory use as $3×$ over SFT, limiting batch size and practical deployment. LoRA adapters, negative-SFT (nSFT), and unified backbones (Hydra-RLHF) mitigate these issues, enabling high-throughput PPO with per-sample latency reductions of up to $65\%$ (Santacroce et al., 2023, Zhu et al., 2024).

6. Quantitative Impact and Empirical Results

The SFT-then-PPO paradigm consistently yields improvements over base and SFT-only models across reasoning, summarization, NLU, and multimodal tasks. Representative empirical findings include:

Setting / Model	Task/Metric	SFT Baseline	SFT-then-PPO	Absolute Gain
TL;DR Summarization	GPT-3.5 Win-rate	50% (SFT)	65%+ (PPO)	+15%
NLU (LLAMA2-7B)	GLUE avg	78.5	84.8	+6.3
GeneralPoints OOD	OOD acc peak loss	declining	restored	restorable only if SFT ends at intermediate entropy (Jin et al., 8 Sep 2025)
Math-Reasoning	MATH500	87.4	91.2 (ThinkPO/DPO)	+3.8
Vision-Language	MMBench (nSFT/PPO)	64.7 (PPO)	65.2 (nSFT)	nSFT ≥ PPO

In multimodal RLHF, negative SFT (nSFT) matches or exceeds PPO at half the memory/time cost, with ablations showing most benefits derive from explicit negative supervision (Zhu et al., 2024).

7. Theoretical Interpretation and Future Directions

Spectral analysis reveals that SFT rotates singular vector spaces to align with training data, eroding OOD reasoning, while PPO gently restores a more robust orientation (Jin et al., 8 Sep 2025). Negative supervision, whether via DPO or synthetic negative responses, forms the mechanistic core of preference alignment (Zhu et al., 2024). Memory- and compute-efficient architectural innovations (Hydra-PPO) and trust-region regularizations (PSFT) represent current best practices for scalable deployment (Santacroce et al., 2023, Zhu et al., 25 Aug 2025).

A plausible implication is that future SFT-then-PPO paradigms will increasingly leverage parameter-efficient fine-tuning, theoretically informed regularizations, and data-centric augmentations (e.g., negative sampling, synthetic error mining) to maximize both efficiency and generalization. As understanding grows, regimes such as direct RL-only post-pretraining in VLMs (Chen et al., 10 Apr 2025), or model-based self-rewarding PPO (Zhang et al., 24 Oct 2025), may supplant the strict two-stage SFT-then-PPO workflow. Nonetheless, in most settings, careful SFT initialization followed by constrained PPO remains foundational for robust, aligned model behavior.