SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Abstract: Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic reasoning card game, and adopt V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants in both textual and visual domains. We show that RL, especially when trained with an outcome-based reward, generalizes across both rule-based textual and visual variants. SFT, in contrast, tends to memorize training data and struggles to generalize out-of-distribution scenarios. Further analysis reveals that RL improves the model's underlying visual recognition capabilities, contributing to its enhanced generalization in the visual domain. Despite RL's superior generalization, we show that SFT remains essential for effective RL training; SFT stabilizes the model's output format, enabling subsequent RL to achieve its performance gains. These findings demonstrates the capability of RL for acquiring generalizable knowledge in complex, multi-modal tasks.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
SFT Memorizes, RL Generalizes — Explained Simply
What is this paper about?
This paper compares two popular ways to train big AI models after they’ve been pre-trained:
- Supervised fine-tuning (SFT): teaching by showing lots of example questions with the “right” answers.
- Reinforcement learning (RL): teaching by letting the model try, then giving it points (rewards) when it does well.
The main idea: SFT tends to “memorize” training examples, while RL helps models “generalize”—that is, do well on new situations they haven’t seen before. The authors test this for both text-only tasks and tasks that use images plus text.
What questions are the researchers asking?
In simple terms, they ask:
- Does SFT make models good only at problems that look like their training data?
- Does RL help models learn flexible rules they can use in new situations?
- Does RL also improve how well models “see” (recognize things in images)?
- Is SFT still useful when training with RL?
- Does letting the model check and fix its answers multiple times help it generalize better?
How did they test this? (Methods in everyday language)
They used two kinds of challenges:
- A math card game (like the “24 game”):
- The model gets four cards and must make a target number (usually 24) using each card once.
- Two versions: text-only (cards described in words) and vision-language (cards shown in an image).
- Rule twist: Sometimes J/Q/K count as 11/12/13; other times they all count as 10. This checks if the model can handle rule changes it hasn’t seen before.
- Visual twist: Train on black-suited cards, test on red-suited cards. This checks visual generalization.
- A real-world navigation task:
- The model sees street photos or descriptions and follows directions to reach a place.
- Two versions: text-only and vision-language (with real images).
- Rule twist: Train using “absolute” directions (north/east/…) and test using “relative” ones (turn left/right). New action rules = test of generalization.
- Visual twist: Train in one city (like New York), test in other cities.
How they trained the models:
- They started with a large vision-LLM (Llama 3.2 Vision).
- First, they tried SFT: show the model many examples with correct answers.
- Then, they tried RL: let the model attempt an answer, have a “verifier” (like a referee) check if it’s correct, give reward points, and allow the model to try again.
- This “try → get feedback → revise” loop is called “sequential revision.” Think of it like writing a draft, getting comments, and editing your answer.
- The “outcome-based reward” means the model gets points for the end result being right, not just for sounding good.
What did they find, and why does it matter?
Big picture: RL helps models generalize; SFT tends to memorize.
Key results:
- Rule changes (text rules):
- RL improved performance on unseen rules in both the card game and navigation tasks.
- SFT often dropped a lot when rules changed, meaning it had learned the training setup too narrowly.
- Visual changes (images, new cities, new colors):
- RL handled new visual settings much better than SFT.
- In a navigation benchmark across cities, their RL approach boosted success from about 44% to about 78%—a very large jump.
- RL improved “seeing”:
- In the card game with images, RL made the model better at recognizing what’s on the cards (a visual skill), which then led to better problem-solving.
- SFT still matters:
- SFT helps the model follow instructions and produce answers in a clean format (like giving a response in the right structure). Without this, RL struggled to get started because the model’s outputs were messy, making feedback and scoring harder.
- More chances to check and fix helps:
- Letting the model verify and revise its answers more times (more “verification steps”) led to better generalization.
Why it matters:
- If you want AI that adapts to new rules, new places, or new visuals, RL is more effective.
- If you only use SFT, your model may look good on practice problems but stumble on new, slightly different ones.
- Combining SFT (for stable, well-formatted answers) with RL (for flexible thinking) works best.
What does this mean for the future?
- For building reliable assistants, tutors, or robots that operate in the real world, training with feedback (RL) helps them handle surprises and variations.
- SFT is still useful to teach the model how to follow instructions and format answers, but RL is key for learning general skills, not just copying.
- Letting models “think, check, and fix” (more verification steps) is a powerful way to boost generalization.
- These ideas can improve AI in many areas: math reasoning, map navigation, reading diagrams, understanding photos, and more.
In short: Teaching AI with feedback (RL) helps it learn the rules of the game, not just the answers to last year’s test.
Collections
Sign up for free to add this paper to one or more collections.