SFT-then-PPO Paradigm
- The SFT-then-PPO paradigm is a two-stage post-training process where supervised fine-tuning first aligns model responses before PPO restores and enhances out-of-distribution reasoning.
- It rebalances model performance by countering SFT-induced over-specialization through reward-driven adjustments that recover lost generalization capabilities.
- Practical implementations leverage efficient techniques and variants like PSFT and self-rewarding PPO, optimizing memory use and computational cost across diverse tasks.
The SFT-then-PPO paradigm refers to a two-stage pipeline for post-training LLMs and vision-LLMs (VLMs), in which standard supervised fine-tuning (SFT) on labeled data precedes on-policy reinforcement learning (RL) using Proximal Policy Optimization (PPO) or its variants. This workflow forms the empirical and methodological backbone of RLHF and related alignment regimes; it is widely used for improving alignment with human preferences, enhancing reasoning robustness, and adapting base models to new domains or user tasks.
1. The SFT-then-PPO Pipeline: Definition and Standard Workflow
The classical SFT-then-PPO pipeline consists of two sequential post-training phases:
- Supervised Fine-Tuning (SFT): The model is finetuned with cross-entropy loss on task-annotated input–output pairs :
SFT rapidly aligns the model to the data distribution and target format.
- Proximal Policy Optimization (PPO): The SFT checkpoint is further finetuned using an external reward —derived from a reward model, environment, or other feedback—via the PPO clipped surrogate loss:
with and estimated via generalized advantage estimation (GAE).
This pipeline is consistently observed in text-only LLMs (Huang et al., 2024, Zhang et al., 24 Oct 2025), VLMs (Zhu et al., 2024, Chen et al., 10 Apr 2025), and aligns with architectures in which PPO is interpreted as preference optimization or reward maximization under trust-region constraints. The reward may derive from explicit human preferences, automatic task metrics, or even internal reward constructions as in self-rewarding PPO (Zhang et al., 24 Oct 2025).
2. Mechanisms and Role of Each Stage
2.1 Supervised Fine-Tuning: Alignment and Format Induction
SFT performs rapid format alignment, minimizing in-distribution (ID) cross-entropy and bringing outputs into the desired structure. This stage finds an "aligned" but typically narrow optimum on the target distribution. However, continued SFT causes the phenomenon of OOD (out-of-distribution) forgetting: OOD generalization metrics peak at an early SFT checkpoint and then degrade, even while ID metrics continue to improve. This loss of OOD capability has been attributed to hard rotation of key singular-vector directions in model weight matrices, rather than shifts in spectral capacity (Jin et al., 8 Sep 2025).
2.2 PPO: Reward Maximization, OOD Restoration, and Generalization
PPO, when initialized from a well-aligned SFT checkpoint, serves a restorative role:
- It recovers OOD reasoning performance lost during prolonged SFT, rather than inventing new OOD behavior.
- Empirically, PPO rotates the singular vectors of key layers back towards their pre-SFT configuration, softening alignment and restoring diversity (Jin et al., 8 Sep 2025).
- The boundaries for effective restoration are sharp: if SFT trains for too short or too long, PPO cannot recover lost OOD performance.
This challenges the notion that "SFT memorizes, RL generalizes"; the mechanisms are better described as SFT specializing and PPO re-balancing generalized capabilities.
3. Practical Implementations and Engineering Details
A standard deployment of SFT-then-PPO in large models requires careful management of memory and compute, with multiple parallel models (policy, critic, reference, reward) in PPO:
| PPO Component | Function | Scaling |
|---|---|---|
| Actor | Generative policy π_θ | Full |
| Critic | Value estimator V_ψ | Full |
| Reference | PPO KL anchor π_ref | Frozen |
| Reward | Learned reward r_φ | Frozen |
In practice, this increases memory requirements 3–4× over SFT (Santacroce et al., 2023), making innovations such as Hydra-RLHF—which unifies model heads and exploits LoRA adapters—highly impactful, reducing PPO memory to parity or better than SFT (Santacroce et al., 2023).
The TL;DR RLHF pipeline (Huang et al., 2024) exemplifies detailed recipe engineering for SFT→RM→PPO: frozen SFT model and reward head, AdamW optimization, reward normalization, reward/advantage whitening, strict prompt handling, and KL penalties. Batch sizes, advantage whitening, and KL coefficients are tuned to maintain stable PPO optimization and avoid reward hacking or catastrophic forgetting.
4. Variants and Extensions: Trust-Region SFT, Self-Rewarding PPO, and Preference Optimization
4.1 Proximal SFT (PSFT)
Inspired by TRPO/PPO, PSFT replaces pure cross-entropy in SFT with a PPO-style clipped surrogate:
with . PSFT improves entropy stability, OOD generalization, and leaves more room for downstream RL optimization (Zhu et al., 25 Aug 2025).
4.2 Self-Rewarding PPO
Self-Rewarding PPO (Zhang et al., 24 Oct 2025) aligns solely using demonstration data: the reward is the log-policy ratio , using the base model as a baseline. PPO maximizes this coherent reward, achieving gains in generalization and data efficiency without preference labels. This approach is particularly effective in low-data regimes and where reward annotation is infeasible.
4.3 Preference-Based Optimization: DPO/ThinkPO
When human preference data is available, direct preference optimization (e.g., DPO) replaces on-policy PPO with a ranking loss + KL divergence between positive (preferred) and negative (rejected) outputs post-SFT:
with (Yang et al., 17 Feb 2025). DPO is more stable and compute-efficient due to its offline nature and has become a prominent variant in the SFT-then-RLHF family.
5. Limitations, Failure Modes, and Design Considerations
5.1 Cold Start and Data Regimes
Empirical evidence demonstrates that eliminating SFT and applying PPO or DPO directly to a base model leads to poor performance—this is the "cold start" problem (Raghavendra et al., 16 Feb 2025). Allocating even of annotation budget to SFT can yield accuracy improvements on tasks like GSM8K. In low-budget regimes ( examples), SFT alone dominates; as budget increases, shifting more toward preference/RL data optimizes performance.
5.2 Overfitting, "Pseudo-Reasoning," and OOD Collapse
In VL-LMs (Chen et al., 10 Apr 2025), SFT can induce "pseudo-reasoning paths," prolonging and distorting reasoning without true improvements. RL-only pipelines with mixed or carefully designed rewards yield more authentic, concise, and effective reasoning traces. Over-specialization in SFT may collapse policy entropy, making PPO ineffective at restoring OOD capabilities (Jin et al., 8 Sep 2025).
5.3 Efficiency: Memory, Throughput, and Computational Cost
Standard PPO scales memory use as $3×$ over SFT, limiting batch size and practical deployment. LoRA adapters, negative-SFT (nSFT), and unified backbones (Hydra-RLHF) mitigate these issues, enabling high-throughput PPO with per-sample latency reductions of up to (Santacroce et al., 2023, Zhu et al., 2024).
6. Quantitative Impact and Empirical Results
The SFT-then-PPO paradigm consistently yields improvements over base and SFT-only models across reasoning, summarization, NLU, and multimodal tasks. Representative empirical findings include:
| Setting / Model | Task/Metric | SFT Baseline | SFT-then-PPO | Absolute Gain |
|---|---|---|---|---|
| TL;DR Summarization | GPT-3.5 Win-rate | 50% (SFT) | 65%+ (PPO) | +15% |
| NLU (LLAMA2-7B) | GLUE avg | 78.5 | 84.8 | +6.3 |
| GeneralPoints OOD | OOD acc peak loss | declining | restored | restorable only if SFT ends at intermediate entropy (Jin et al., 8 Sep 2025) |
| Math-Reasoning | MATH500 | 87.4 | 91.2 (ThinkPO/DPO) | +3.8 |
| Vision-Language | MMBench (nSFT/PPO) | 64.7 (PPO) | 65.2 (nSFT) | nSFT ≥ PPO |
In multimodal RLHF, negative SFT (nSFT) matches or exceeds PPO at half the memory/time cost, with ablations showing most benefits derive from explicit negative supervision (Zhu et al., 2024).
7. Theoretical Interpretation and Future Directions
Spectral analysis reveals that SFT rotates singular vector spaces to align with training data, eroding OOD reasoning, while PPO gently restores a more robust orientation (Jin et al., 8 Sep 2025). Negative supervision, whether via DPO or synthetic negative responses, forms the mechanistic core of preference alignment (Zhu et al., 2024). Memory- and compute-efficient architectural innovations (Hydra-PPO) and trust-region regularizations (PSFT) represent current best practices for scalable deployment (Santacroce et al., 2023, Zhu et al., 25 Aug 2025).
A plausible implication is that future SFT-then-PPO paradigms will increasingly leverage parameter-efficient fine-tuning, theoretically informed regularizations, and data-centric augmentations (e.g., negative sampling, synthetic error mining) to maximize both efficiency and generalization. As understanding grows, regimes such as direct RL-only post-pretraining in VLMs (Chen et al., 10 Apr 2025), or model-based self-rewarding PPO (Zhang et al., 24 Oct 2025), may supplant the strict two-stage SFT-then-PPO workflow. Nonetheless, in most settings, careful SFT initialization followed by constrained PPO remains foundational for robust, aligned model behavior.