Achieving DeepSeek R1-32B-level performance with only 1,000 samples
Determine whether the reasoning performance reported for DeepSeek R1-32B can be matched using supervised fine-tuning on only 1,000 distilled reasoning examples, rather than training on approximately 800,000 or more samples. Concretely, assess if a 32B-parameter model trained on a 1,000-example dataset like s1K can reach R1-32B’s benchmark results on AIME24, MATH500, and GPQA Diamond without the large-scale data used by R1-32B.
References
However, it is trained on 800 × more reasoning samples. It is an open question whether one can achieve their performance with just 1,000 samples.
— s1: Simple test-time scaling
(2501.19393 - Muennighoff et al., 31 Jan 2025) in Section 4 (Results), subsection "Performance", Sample-efficiency paragraph