Mechanism of Memorization-Driven Generalization in Long-CoT Supervised Fine-Tuning

Establish the causal mechanism underlying the repetition advantage in supervised fine-tuning of pretrained language models on long chain-of-thought demonstrations, specifically why achieving near-perfect training token accuracy through multi-epoch repetition coincides with improved downstream generalization on reasoning benchmarks without additional catastrophic forgetting.

Background

The paper reports a robust “repetition advantage” in long chain-of-thought supervised fine-tuning: for a fixed update budget, training for many epochs on smaller datasets consistently outperforms single-epoch training on larger datasets across models (Olmo3-7B, Qwen3-8B, Qwen3-4B) and benchmarks (AIME’24/’25, GPQA).

Performance gains saturate when training token accuracy approaches 100%, even as validation loss rises, and termination rates improve with repetition. Despite classical signs of overfitting (e.g., train–validation loss gaps, reduced entropy), downstream reasoning accuracy improves, and catastrophic forgetting is lower for multi-epoch repetition than for single-epoch large-data training.

The authors do not identify a definitive causal explanation for why full memorization of training trajectories correlates with improved generalization in reasoning tasks, and they explicitly pose this as an open problem.

References

We argue that explaining why memorization under repetition improves generalization in reasoning SFT is an important open problem.

Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning  (2602.11149 - Kopiczko et al., 11 Feb 2026) in Conclusion