Papers
Topics
Authors
Recent
Search
2000 character limit reached

SFT Memorizes, RL Generalizes

Updated 7 January 2026
  • The paper shows that SFT focuses on reproducing chain-of-thought examples, leading to memorization and brittle out-of-distribution performance compared to RL.
  • Quantitative analyses reveal that RL’s reward-driven policy optimization achieves higher safety scores and maintains reasoning depth across various tasks and architectures.
  • Empirical metrics like token entropy and cross-task performance confirm that RL offers robust generalization and better adaptability than SFT’s rigid data-fitting approach.

Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) constitute the two central paradigms in post-training large reasoning models (LRMs), especially for alignment and reasoning robustness. The dichotomy "SFT Memorizes, RL Generalizes" refers to observed differences in the optimization objectives and the resulting behavioral profiles and generalization properties: SFT drives LRMs to reproduce example trajectories, often resulting in memorization and brittle out-of-distribution (OOD) performance; RL, through reward-driven policy optimization, induces more flexible, adaptive reasoning strategies with improved transfer across model families, tasks, and risk categories. This article synthesizes key technical definitions, quantitative metrics, empirical findings, and mechanistic analyses from recent research including "Beyond SFT: Reinforcement Learning for Safer Large Reasoning Models with Better Reasoning Ability" (Jia et al., 1 Dec 2025), presenting a comprehensive account of this foundational distinction.

1. Formal Objectives: SFT as Memorization, RL as Generalization

Supervised Fine-Tuning (SFT) operates by maximizing the log-likelihood of annotated chain-of-thought (CoT) traces over a dataset D\mathcal{D}, with the canonical loss defined as

LSFT(θ)=E(x,t,y)D[logPθ(t,yx)]\mathcal{L}_{\text{SFT}}(\theta) = - \mathbb{E}_{(x, t, y) \sim \mathcal{D}} [ \log P_\theta(t, y \mid x) ]

where xx is the prompt, tt the safe reasoning trace, and yy the safe answer. This objective implicitly enforces exact reproduction of the sampled CoT exemplars, promoting memorization of the specific trajectory structure present in D\mathcal{D}.

In contrast, RL treats the model as a policy πθ\pi_\theta over output trajectories and performs direct optimization of the expected reward assigned to each trajectory:

J(θ)=Expprompt,(t,y)πθ(x)[R(x,t+y)]J(\theta) = \mathbb{E}_{x \sim p_{prompt}, (t, y) \sim \pi_\theta(\cdot \mid x)} [ R(x, t + y) ]

with RR denoting the safety reward; in practice, scoring is performed via a pretrained reward model (e.g. Skywork-Reward-V2). RL uses policy-gradient approaches (e.g. Reinforce++, PPO) with auxiliary KL penalties and clipping for stable updates. The critical distinction is that RL updates are directly shaped by scalar feedback on full trajectories, not merely by reproduction fidelity, which enables discovery of previously unseen safe reasoning modes and broader adaptation (Jia et al., 1 Dec 2025).

2. Quantitative Measures: Assessing Memorization and Generalization

The memorization tendencies of SFT and generalization achieved by RL are quantified with several metrics:

  • Min-K% Probability (Shi et al. 2023): Calculates the likelihood assigned by the model to the lowest-K% probability tokens in the training data; high values in SFT indicate strong memorization of rare tokens.
  • Reflection Token Entropy: At reflection words (e.g. "wait", "hmm", "but"), token-level entropy Ht=jpt,jlogpt,jH_t = - \sum_j p_{t, j} \log p_{t, j} measures the degree of exploratory behavior; SFT sharply reduces entropy ("over-regularizes"), RL applies adaptive suppression (unsafe contexts) or preservation (reasoning contexts).
  • Cross-model and Cross-task Generalization: Safety gains are measured by transferring SFT or RL-tuned policies across architectures and benchmark tasks—SFT "fits" only the dataset used, RL exhibits robust transfer.
  • Reasoning and Safety Benchmarks: Performance on AttaQ (adversarial harm detection), AIR-Bench (regulatory refusal rates), GPQA-Diamond, MATH500, AIME24/25 (graduate-level and competition mathematics) allows fine calibration. RL consistently achieves equal or higher safety gains while preserving or improving reasoning competence.
Model AttaQ AIR-Bench GPQA-Diamond MATH500 AIME24 AIME25
Base 0.37 0.26 49.24 92.00 46.30 30.52
SFT (STAR-1) 0.76 0.59 47.54 91.80 46.88 31.87
RL (Reinforce++) 0.78 0.66 49.68 92.30 49.53 32.14

SFT boosts safety metrics but often degrades reasoning scores and transferability; RL delivers consistent improvements in both domains.

3. Token-Level Dynamics: Entropy and Reflection Depth

Detailed analyses of token-level entropy and reflective depth illuminate the behavioral divergence:

  • On unsafe prompts (AttaQ), average entropy at reflection tokens falls from 0.24 (base) to 0.12 (SFT) to 0.09 (RL). RL achieves the lowest entropy, aggressively suppressing unsafe exploration.
  • On reasoning prompts (AIME24), entropy drops from 3.12 (base) to 2.73 (SFT) but remains near-baseline (3.00) under RL, indicating maintenance of reflective capacity.
  • SFT uniformly compresses entropy on both domains, risking under-exploration and reduced problem-solving ability; RL applies context-sensitive entropy modulation.

Reflection depth analysis demonstrates that RL models selectively truncate reasoning on unsafe inputs while sustaining or modestly extending depth on valid problems—a discriminative responsiveness absent in SFT.

4. Generalization Analysis: Mechanism and Transfer

RL's generalization arises from the reward-shaped exploration of policy space, as opposed to SFT's example-matching principle:

  • Broader Safety Coverage: RL gains extend across regulatory categories (AIR-Bench), whereas SFT exploits surface regularities.
  • Cross-architecture Adaptability: RL reward feedback is model-agnostic, enabling transfer to base policies outside the source distribution; SFT-tuned policies fail under architecture shift.
  • Skill Preservation: RL stabilizes reasoning skills, counteracting catastrophic forgetting induced by SFT's rigid cross-entropy updates.

The policy optimization process in RL continuously adapts to the reward function, whereas SFT's updates risk erasure of previously acquired skills in the pursuit of label conformity.

5. Limitations and Practical Implications

Although the distinction is robust, there are nuances:

  • SFT's scope is limited: Memorization provides fast in-distribution gains but poor robustness, with evidence of over-regularization especially in reasoning-rich tasks.
  • RL requires stabilized starting points: RL post-training is most effective when commenced on a robust SFT baseline; extreme overfitting or underfitting can undermine subsequent RL generalization.
  • Computational trade-offs: RL optimization is more resource-intensive and may require reward model engineering. However, the cross-task and cross-architecture safety gains are substantial and stable.

Practitioners are advised to employ SFT as an alignment warm-up but to rely on RL optimization for robust safety and reasoning generalization, particularly in multi-family or deployment-critical settings.

6. Research Outlook and Synthesis

The empirical evidence and mechanistic insights indicate that SFT memorizes its provided CoT traces—achieving high in-distribution scores, increased likelihood on training tokens, and collapsed exploration on hard reasoning tasks. RL generalizes by dynamically optimizing for high-reward trajectories, suppressing risky branches and sustaining healthy exploratory reasoning where necessary. This leads to marked cross-model and cross-task generalization of both safety and reasoning competence (Jia et al., 1 Dec 2025). Further integration of SFT and RL, potentially via adaptive curriculum schedules, meta-learning, or hybrid losses, remains a challenge and opportunity for future LRM alignment research.

The phrase "SFT Memorizes, RL Generalizes" embodies a rigorous, model- and data-agnostic distinction in LRM alignment objectives, substantiated by both quantitative safety/reasoning benchmarks and theoretical analyses of entropy, depth, and transfer. As the post-training landscape evolves, RL's generalization advantage is increasingly central for robust and reliable deployment of explicit reasoning models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SFT Memorizes, RL Generalizes.