SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Published 28 Jan 2025 in cs.AI, cs.CV, and cs.LG | (2501.17161v2)

Abstract: Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic reasoning card game, and adopt V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants in both textual and visual domains. We show that RL, especially when trained with an outcome-based reward, generalizes across both rule-based textual and visual variants. SFT, in contrast, tends to memorize training data and struggles to generalize out-of-distribution scenarios. Further analysis reveals that RL improves the model's underlying visual recognition capabilities, contributing to its enhanced generalization in the visual domain. Despite RL's superior generalization, we show that SFT remains essential for effective RL training; SFT stabilizes the model's output format, enabling subsequent RL to achieve its performance gains. These findings demonstrates the capability of RL for acquiring generalizable knowledge in complex, multi-modal tasks.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that reinforcement learning (RL) significantly improves generalization in foundation models compared to supervised fine-tuning (SFT), which tends to memorize training data.
The paper utilizes rigorous experiments on arithmetic reasoning and spatial navigation tasks in both unimodal and multimodal settings, highlighting RL's enhancements with improvements up to 11.0% and SFT's steep declines.
The paper reveals that combining RL with increased verification iterations effectively boosts visual recognition capabilities, suggesting promising directions for future adaptive multimodal systems.

SFT Memorizes, RL Generalizes: Insights from a Comparative Study of Foundation Model Post-training

The research paper titled "SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training" provides a detailed examination of the roles that supervised fine-tuning (SFT) and reinforcement learning (RL) play in enhancing the generalization capabilities of foundation models. The authors, Chu et al., investigate the effects of these post-training techniques, notably scrutinizing their behavior in memorization versus generalization tasks within both unimodal text-based and multimodal contexts.

Methodology and Experimental Design

The study is essentially an empirical exploration into how foundation models adapt to new, unseen tasks after being fine-tuned through either SFT or RL. The authors present an innovative initial scenario using two distinct environments: the General Points (GP) task for arithmetic reasoning, and the V-IRL task for spatial navigation challenges in realistic visual settings. For each task, two variants are explored: one focusing solely on linguistic cues (e.g., GP-L) and the other involving both visual and linguistic inputs (e.g., GP-VL).

The GP environment requires models to output arithmetic equations that yield a target number using only visual descriptions or graphical depictions of playing cards. Meanwhile, the V-IRL task challenges models to navigate based on complex textual and spatial instructions, assessing their capability to integrate visual observations into decision-making processes.

The core experimental approach involves:

Conducting a series of training sessions using SFT or RL on accessible rules/configurations before testing on new rule variants.
Providing models with visual and rule-based variants to test their capacity to generalize beyond memorized data.
Utilizing outcome-based rewards for RL, contrasting its finesse in deriving general rules against SFT's penchant for copying training details.

Key Findings

Generalization vs. Memorization: The paper finds that RL significantly outperforms SFT in terms of generalization across unseen rule changes and visual constituents. RL's ability to learn generalizable strategies is underscored in scenarios where SFT fails to adapt beyond memorizing training data.
Numerical Insights: The study reveals substantial numerical divergences between RL and SFT outcomes in OOD (out-of-distribution) testing. For instance, RL improves performance by 3.5% on GP-L and 11.0% on V-IRL-L beyond initial baselines, whereas SFT shows decreases of 8.1% and 79.5%, respectively.
Impact on Visual Recognition: RL also enhances visual recognition capabilities within VLMs, indicated by improvements in tasks that involve visual perception. This enhancement is crucial in complex multimedia environments where visual understanding and textual reasoning must converge.
Role of SFT: Despite its less robust generalization, SFT plays a crucial stabilizing role for RL training in ensuring the model's outputs align with anticipated formats, thus supporting subsequent RL gains.
Verification Iterations: A rise in verification iterations during evaluation significantly boosts generalization, suggesting inference-time compute scaling is essential for maximizing RL's efficacy.

Implications and Future Directions

This research delineates the nuanced roles of SFT and RL in post-training foundation models and provides a data-driven rationale for employing RL over SFT when generalization is requisite. The implications extend into practical AI applications requiring adaptability, such as autonomous navigation, dynamic strategy games, and contexts requiring real-time data interpretation.

Future exploration may involve refining RL techniques for even finer-grained control over multimodal learning tasks or improving SFT to encompass strategies that synthetically encourage generalization. Moreover, assessing these results across diverse architectures and model scales could bolster the robustness of the conclusions drawn from this compelling study. These considerations represent promising avenues for advancing the field of machine learning towards increasingly adaptive intelligent systems.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

SFT Memorizes, RL Generalizes — Explained Simply

What is this paper about?

This paper compares two popular ways to train big AI models after they’ve been pre-trained:

Supervised fine-tuning (SFT): teaching by showing lots of example questions with the “right” answers.
Reinforcement learning (RL): teaching by letting the model try, then giving it points (rewards) when it does well.

The main idea: SFT tends to “memorize” training examples, while RL helps models “generalize”—that is, do well on new situations they haven’t seen before. The authors test this for both text-only tasks and tasks that use images plus text.

What questions are the researchers asking?

In simple terms, they ask:

Does SFT make models good only at problems that look like their training data?
Does RL help models learn flexible rules they can use in new situations?
Does RL also improve how well models “see” (recognize things in images)?
Is SFT still useful when training with RL?
Does letting the model check and fix its answers multiple times help it generalize better?

How did they test this? (Methods in everyday language)

They used two kinds of challenges:

A math card game (like the “24 game”):
- The model gets four cards and must make a target number (usually 24) using each card once.
- Two versions: text-only (cards described in words) and vision-language (cards shown in an image).
- Rule twist: Sometimes J/Q/K count as 11/12/13; other times they all count as 10. This checks if the model can handle rule changes it hasn’t seen before.
- Visual twist: Train on black-suited cards, test on red-suited cards. This checks visual generalization.
A real-world navigation task:
- The model sees street photos or descriptions and follows directions to reach a place.
- Two versions: text-only and vision-language (with real images).
- Rule twist: Train using “absolute” directions (north/east/…) and test using “relative” ones (turn left/right). New action rules = test of generalization.
- Visual twist: Train in one city (like New York), test in other cities.

How they trained the models:

They started with a large vision-LLM (Llama 3.2 Vision).
First, they tried SFT: show the model many examples with correct answers.
Then, they tried RL: let the model attempt an answer, have a “verifier” (like a referee) check if it’s correct, give reward points, and allow the model to try again.
This “try → get feedback → revise” loop is called “sequential revision.” Think of it like writing a draft, getting comments, and editing your answer.
The “outcome-based reward” means the model gets points for the end result being right, not just for sounding good.

What did they find, and why does it matter?

Big picture: RL helps models generalize; SFT tends to memorize.

Key results:

Rule changes (text rules):
- RL improved performance on unseen rules in both the card game and navigation tasks.
- SFT often dropped a lot when rules changed, meaning it had learned the training setup too narrowly.
Visual changes (images, new cities, new colors):
- RL handled new visual settings much better than SFT.
- In a navigation benchmark across cities, their RL approach boosted success from about 44% to about 78%—a very large jump.
RL improved “seeing”:
- In the card game with images, RL made the model better at recognizing what’s on the cards (a visual skill), which then led to better problem-solving.
SFT still matters:
- SFT helps the model follow instructions and produce answers in a clean format (like giving a response in the right structure). Without this, RL struggled to get started because the model’s outputs were messy, making feedback and scoring harder.
More chances to check and fix helps:
- Letting the model verify and revise its answers more times (more “verification steps”) led to better generalization.

Why it matters:

If you want AI that adapts to new rules, new places, or new visuals, RL is more effective.
If you only use SFT, your model may look good on practice problems but stumble on new, slightly different ones.
Combining SFT (for stable, well-formatted answers) with RL (for flexible thinking) works best.

What does this mean for the future?

For building reliable assistants, tutors, or robots that operate in the real world, training with feedback (RL) helps them handle surprises and variations.
SFT is still useful to teach the model how to follow instructions and format answers, but RL is key for learning general skills, not just copying.
Letting models “think, check, and fix” (more verification steps) is a powerful way to boost generalization.
These ideas can improve AI in many areas: math reasoning, map navigation, reading diagrams, understanding photos, and more.

In short: Teaching AI with feedback (RL) helps it learn the rules of the game, not just the answers to last year’s test.

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Authors (9)

Collections

Tweets

YouTube

Show All Videos

HackerNews

Supervised Fine-Tuning Memorizes, RL Generalizes (1 point, 0 comments)
SFT Memorizes,RL Generalizes: Comparative Study of Foundation Model PostTraining (1 point, 0 comments)

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Summary

SFT Memorizes, RL Generalizes: Insights from a Comparative Study of Foundation Model Post-training

Methodology and Experimental Design

Key Findings

Implications and Future Directions

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

SFT Memorizes, RL Generalizes — Explained Simply

What is this paper about?

What questions are the researchers asking?

How did they test this? (Methods in everyday language)

What did they find, and why does it matter?

What does this mean for the future?

Open Problems

Continue Learning

Related Papers

Authors (9)

Collections

Tweets

YouTube

HackerNews