Lower bound on sample size for effective LIMO fine-tuning

Determine the minimum number of high-quality supervised fine-tuning samples required to maintain effective mathematical reasoning performance when fine-tuning Qwen2.5-32B-Instruct using the LIMO dataset and training recipe, as assessed on benchmarks such as AIME24 and MATH500.

Background

The paper investigates sample efficiency by ranking questions in the LIMO-Pool and constructing datasets of varying sizes (400, 800, 1,200, 1,600, and 2,000) to fine-tune Qwen2.5-32B-Instruct with an identical training recipe. Results show dramatic gains with as few as 400 samples and diminishing returns beyond 800 samples, leading the authors to note that while small datasets can elicit strong reasoning, the exact lower bound required to preserve effective performance is unresolved.

This open question focuses on identifying the minimum dataset size that sustains the competitive pass@1 accuracy achieved by the LIMO approach on established mathematical reasoning benchmarks, thereby clarifying the limits of data-efficient supervised fine-tuning under the paper’s methodology.

References

Our experiments reveal that a surprisingly small number (i.e. 800) of samples can elicit competition-level mathematical reasoning, though the lower bound for maintaining effective performance remains an open question.

LIMO: Less is More for Reasoning  (2502.03387 - Ye et al., 5 Feb 2025) in Subsubsection RQ5: Sample Efficiency (Section Experiment → Analysis)