Achieving DeepSeek R1-32B-level performance with only 1,000 samples

Determine whether the reasoning performance reported for DeepSeek R1-32B can be matched using supervised fine-tuning on only 1,000 distilled reasoning examples, rather than training on approximately 800,000 or more samples. Concretely, assess if a 32B-parameter model trained on a 1,000-example dataset like s1K can reach R1-32B’s benchmark results on AIME24, MATH500, and GPQA Diamond without the large-scale data used by R1-32B.

Background

In the paper, the authors introduce s1-32B, a 32B-parameter model trained with supervised fine-tuning on a carefully curated dataset of 1,000 reasoning examples (s1K). Despite its sample efficiency, s1-32B achieves competitive results relative to several strong baselines.

The concurrently released DeepSeek R1-32B shows stronger performance but was trained on orders of magnitude more reasoning data (about 800,000 samples). The authors explicitly raise whether similar performance could be obtained using only 1,000 samples, framing a sample-efficiency frontier question for achieving state-of-the-art reasoning with minimal data.

References

However, it is trained on 800 × more reasoning samples. It is an open question whether one can achieve their performance with just 1,000 samples.

s1: Simple test-time scaling  (2501.19393 - Muennighoff et al., 31 Jan 2025) in Section 4 (Results), subsection "Performance", Sample-efficiency paragraph