Create a Video View Paper

Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning

This presentation challenges conventional wisdom in machine learning by demonstrating that repeatedly training language models on small datasets for many epochs dramatically outperforms the standard approach of training on large datasets for few epochs. Through rigorous experiments on mathematical and scientific reasoning benchmarks, the research reveals that data repetition—even to the point of complete memorization—produces superior reasoning performance while using less compute. This counterintuitive finding has immediate practical implications for how we fine-tune large language models and raises fundamental questions about overfitting, generalization, and the nature of learning in neural networks.

Script

What if everything we thought we knew about training language models was backwards? The standard approach says more unique data is always better, but this research reveals something startling: repeating a small dataset dozens of times produces dramatically better reasoning than training once on a massive corpus.

Let's start by understanding the challenge the authors set out to address.

Building on that context, machine learning orthodoxy has long held that maximizing unique training examples is the path to better models. This principle has shaped how researchers approach supervised fine-tuning for complex reasoning tasks, where models learn step-by-step problem-solving from annotated demonstrations.

The authors challenged this assumption with a systematic experiment.

Here's where it gets interesting. The researchers tested two competing strategies under identical compute budgets: the conventional approach of training on thousands of unique examples once, versus the repetition approach of training on hundreds of examples 32 or 64 times. The repetition approach won decisively, delivering gains of 12 to 26 percentage points on rigorous mathematical benchmarks.

Connecting this to concrete evidence, each diagonal line in these heatmaps represents identical amounts of computation, but distributed differently between dataset size and repetition. Moving from left to right along any diagonal means fewer unique samples but more repetition, and performance consistently climbs. The gains plateau around 32 to 64 epochs, but the pattern holds across both accuracy metrics and pass rates.

What makes this even more puzzling is that all the classic warning signs of overfitting appear during training. Despite models memorizing the training data completely and showing clear divergence from validation sets, their ability to solve novel reasoning problems gets better, not worse.

Following that surprising result, these plots reveal when to stop training. Token-level accuracy on the training set increases primarily with epochs, and downstream reasoning performance saturates exactly when models achieve complete memorization. This suggests a practical stopping rule: train until your model has fully memorized the demonstrations, then stop.

Bringing this together, the implications extend far beyond efficiency gains. This work fundamentally challenges how we think about generalization in neural networks, suggesting that deep internalization of fewer high-quality examples beats shallow exposure to many. The theoretical mechanism remains an open question that could reshape our understanding of how language models learn to reason.

Data repetition reveals that in machine learning, sometimes less data trained longer is genuinely more. Visit EmergentMind.com to explore the full paper and join the conversation about what this means for the future of model training.