AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset

Published 23 Apr 2025 in cs.AI, cs.CL, and cs.LG | (2504.16891v1)

Abstract: This paper presents our winning submission to the AI Mathematical Olympiad - Progress Prize 2 (AIMO-2) competition. Our recipe for building state-of-the-art mathematical reasoning models relies on three key pillars. First, we create a large-scale dataset comprising 540K unique high-quality math problems, including olympiad-level problems, and their 3.2M long-reasoning solutions. Second, we develop a novel method to integrate code execution with long reasoning models through iterative training, generation, and quality filtering, resulting in 1.7M high-quality Tool-Integrated Reasoning solutions. Third, we create a pipeline to train models to select the most promising solution from many candidates. We show that such generative solution selection (GenSelect) can significantly improve upon majority voting baseline. Combining these ideas, we train a series of models that achieve state-of-the-art results on mathematical reasoning benchmarks. To facilitate further research, we release our code, models, and the complete OpenMathReasoning dataset under a commercially permissive license.

Abstract PDF Upgrade to Chat

Summary

The paper presents a winning AIMO-2 solution that achieved 34/50 correct answers by integrating a large-scale math dataset with novel reasoning techniques.
It details a comprehensive pipeline that combines data curation, tool-integrated reasoning using Python code, and iterative filtering to ensure high-quality solutions.
The approach advances math reasoning models through generative solution selection and efficient inference optimizations, setting new benchmarks in performance.

This paper presents the winning solution for the AI Mathematical Olympiad - Progress Prize 2 (AIMO-2) competition, achieving 34/50 correct answers on the private test set (2504.16891). The approach centers on three key innovations: creating a large-scale math reasoning dataset, developing a method for Tool-Integrated Reasoning (TIR), and implementing Generative Solution Selection (GenSelect).

1. Data Preparation: OpenMathReasoning Dataset

A large dataset, OpenMathReasoning, was created to train the models.

Problem Collection & Refinement: Problems were sourced from the Art of Problem Solving (AoPS) forums (excluding "Middle School Math"). An LLM pipeline (using Qwen2.5-32B-Instruct) was employed for:
- Extracting problems from forum posts.
- Classifying problems (proof, multiple-choice, binary, valid/invalid). Invalid, MCQ, and binary problems were removed.
- Converting proof problems into answer-based formats.
- Extracting final answers from forum discussions for non-proof problems.
- Decontaminating the dataset by removing problems similar to existing benchmarks.
- This resulted in 540K unique problems (Table 1). The full pipeline is available in the NeMo-Skills repository.
Validation Set (Comp-Math-24-25): A validation set of 256 problems was created from 2024/2025 AIME and HMMT competitions to minimize contamination and align with AIMO-2 style problems (Table 3).
Solution Synthesis:
- Chain-of-Thought (CoT): 3.2 million long-reasoning CoT solutions were generated by prompting DeepSeek-R1 and QwQ-32B (up to 32 candidates per problem, temperature 0.7, top-p 0.95, max 16384 tokens). Solutions not matching the expected answer (verified by Qwen2.5-32B-Instruct) were filtered out. For problems without extracted answers, the majority answer across candidates was used as ground truth (Table 4).

2. Tool-Integrated Reasoning (TIR)

Integrating Python code execution with reasoning was challenging as top reasoning models (DeepSeek-R1, QwQ-32B) resisted generating TIR solutions via prompting alone.

Initial TIR Generation: An instruction-following model (LIMO-Qwen-32B), lightly fine-tuned on reasoning data to preserve instruction following, was prompted to generate initial TIR solutions using a specific prompt (Appendix \ref{sec:tir_instruction}). 1.2M solutions were generated.
Filtering Pipeline: Generated TIR solutions often used code trivially. A filtering pipeline was developed:
- LLM-based classification (Qwen2.5-32B-Instruct) of code blocks:
- Novelty: Does the code produce a new result or just verify prior steps? (Appendix \ref{sec:TIR_usage_classification})
- Significance: Is the code crucial or easily replaceable by CoT steps? (Appendix \ref{sec:TIR_significance_evaluation})
- Rule-based filtering: Remove incorrect answers, solutions without code, and initially, solutions with >2 code blocks.
- Tag replacement: python ` -> `<tool_call>`, ` -> </tool_call>.
- This resulted in 15k high-quality "stage-0" TIR samples.
Iterative Improvement:
- QwQ-32B was fine-tuned on the 15k stage-0 data.
- This model generated 700k TIR solutions, filtered down to 260k (removing incorrect/no-code solutions; novelty/significance filters were dropped here).
- An intermediate 14B model (fine-tuned on CoT data) was trained on these 260k solutions.
- A final round of generation and filtering with this 14B model produced the final 1.7M TIR dataset.
Controlling Code Executions: A method was developed to control the number of code blocks allowed per generation. The prompt included a message stating remaining allowed executions (Appendix \ref{sec:tir_solution_code_execution_limit}). Generations exceeding the limit were removed during training, teaching the model to respect the constraint (Appendix \ref{lst:tir-code-limit-reached}).

3. Generative Solution Selection (GenSelect)

To improve upon majority voting and approach pass@k performance, a model was trained to select the best solution from multiple candidates.

Improved Summaries: Original solution summaries (e.g., from DeepSeek-R1) were often too brief. New, longer summaries (up to 2048 tokens) were generated for all solutions in the dataset using Qwen2.5-32B-Instruct, filtering for answer consistency (Appendix \ref{sec:new-summaries}, \ref{sec:summary_comparison}).
Selection Data Generation:
- For each problem, groups of 2-16 summaries were sampled, ensuring at least one correct and one incorrect solution per group (8 groups per problem).
- QwQ-32B was prompted to reason and select the best solution from each group (Appendix \ref{sec:math_genrm_selection}).
- This generated 1M selections, filtered to 566K by keeping only instances where the correct solution was chosen.
Efficiency Improvement: Generating full reasoning traces for selection is computationally expensive. To reduce cost, the comparison reasoning traces from QwQ-32B were themselves summarized using Qwen2.5-32B-Instruct (Appendix \ref{sec:gen_summarization}). Training on these summarized comparisons yielded only a 1-2% accuracy drop while significantly speeding up GenSelect inference.
GenSelect inference is efficient (mostly pre-fill) and works best with <16 candidates; for more candidates, sampling subsets and using majority voting is suggested (Figure \ref{fig:genselect}).

4. OpenMath-Nemotron Models

A series of models (1.5B, 7B, 14B, 32B) were trained based on Qwen2.5-Base.

Training:
- Supervised fine-tuning (SFT) on the combined 5.5M OpenMathReasoning dataset (CoT, TIR, GenSelect samples). Each task used a unique prompt (Appendix \ref{sec:inference-prompts}) enabling mode switching at inference.
- Context length extended using RoPE base adjustment.
- Training used AdamW, cosine decay, sequence packing (NeMo-Aligner), and checkpoint averaging.
- A second SFT round was performed on a harder subset (2.2M samples, Olympiad problems, low pass-rate, >5k tokens) for 4 epochs (except 32B), improving CoT but slightly degrading TIR (Table \ref{tab:training-accuracy}).
Evaluation: Models were evaluated on Comp-Math-24-25 and HLE-Math (Table \ref{tab:eval-results}). TIR generally improves majority@k but can hurt pass@1 for smaller models due to more unfinished generations (Table \ref{tab:unfinished-solutions}). GenSelect consistently improves results.

5. Kaggle Submission Implementation

The winning submission used an intermediate 14B model and several practical optimizations for the competition's constraints (4xL4 GPUs, 5-hour limit, offline notebook).

Model: An earlier 14B model trained differently: 8 epochs SFT on a CoT subset (DeepSeek-R1 only, no proofs), then 400 steps SFT on 15k stage-0 TIR data.
Model Merging: A simple linear merge (CoT*0.3 + TIR*0.7) of the CoT and TIR checkpoints proved effective, improving accuracy while reducing generation length and code usage compared to the pure TIR model (Table \ref{tab:model-merging}). Mergekit was used for experiments.
Inference Optimizations:
- TensorRT-LLM: Used for engine conversion, providing in-flight batching, custom kernels, paged KV cache. FP8 quantization offered the best speed/accuracy trade-off (Table \ref{tab:quant-perf-metrics}).
- Speculative Decoding: ReDrafter was used, training a small RNN drafter on 100k generated solutions to propose up to 3 tokens/step (~65% acceptance), significantly increasing tokens/sec. An almost greedy strategy (temp=0) was used.
- Model Serving: A FastAPI backend using Nemo-Skills handled time-constrained generation per question, async batching with early stopping (canceling remaining requests if a consensus answer emerged), and straggler mitigation.
- Time Management: A time buffer system allocated a base time per question, adding unused time to a buffer that subsequent questions could draw from.
- Code Execution: A Nemo-Skills wrapper called a sandboxed Flask environment for parallel Python execution with limits (6 calls/gen, 2s/call, 200 char output).

Conclusion and Release

The paper details the pipeline combining large-scale data generation (CoT), iterative refinement for Tool-Integrated Reasoning (TIR), and Generative Solution Selection (GenSelect) to build state-of-the-art math reasoning models. The full OpenMathReasoning dataset (540K problems, 3.2M CoT, 1.7M TIR, 566K GenSelect solutions), the OpenMath-Nemotron models, and the code (NeMo-Skills) are released under a commercially permissive license.