- The paper presents a winning AIMO-2 solution that achieved 34/50 correct answers by integrating a large-scale math dataset with novel reasoning techniques.
- It details a comprehensive pipeline that combines data curation, tool-integrated reasoning using Python code, and iterative filtering to ensure high-quality solutions.
- The approach advances math reasoning models through generative solution selection and efficient inference optimizations, setting new benchmarks in performance.
This paper presents the winning solution for the AI Mathematical Olympiad - Progress Prize 2 (AIMO-2) competition, achieving 34/50 correct answers on the private test set (2504.16891). The approach centers on three key innovations: creating a large-scale math reasoning dataset, developing a method for Tool-Integrated Reasoning (TIR), and implementing Generative Solution Selection (GenSelect).
1. Data Preparation: OpenMathReasoning Dataset
A large dataset, OpenMathReasoning, was created to train the models.
- Problem Collection & Refinement: Problems were sourced from the Art of Problem Solving (AoPS) forums (excluding "Middle School Math"). An LLM pipeline (using Qwen2.5-32B-Instruct) was employed for:
- Extracting problems from forum posts.
- Classifying problems (proof, multiple-choice, binary, valid/invalid). Invalid, MCQ, and binary problems were removed.
- Converting proof problems into answer-based formats.
- Extracting final answers from forum discussions for non-proof problems.
- Decontaminating the dataset by removing problems similar to existing benchmarks.
- This resulted in 540K unique problems (Table 1). The full pipeline is available in the NeMo-Skills repository.
- Validation Set (Comp-Math-24-25): A validation set of 256 problems was created from 2024/2025 AIME and HMMT competitions to minimize contamination and align with AIMO-2 style problems (Table 3).
- Solution Synthesis:
- Chain-of-Thought (CoT): 3.2 million long-reasoning CoT solutions were generated by prompting DeepSeek-R1 and QwQ-32B (up to 32 candidates per problem, temperature 0.7, top-p 0.95, max 16384 tokens). Solutions not matching the expected answer (verified by Qwen2.5-32B-Instruct) were filtered out. For problems without extracted answers, the majority answer across candidates was used as ground truth (Table 4).
2. Tool-Integrated Reasoning (TIR)
Integrating Python code execution with reasoning was challenging as top reasoning models (DeepSeek-R1, QwQ-32B) resisted generating TIR solutions via prompting alone.
- Initial TIR Generation: An instruction-following model (LIMO-Qwen-32B), lightly fine-tuned on reasoning data to preserve instruction following, was prompted to generate initial TIR solutions using a specific prompt (Appendix \ref{sec:tir_instruction}). 1.2M solutions were generated.
- Filtering Pipeline: Generated TIR solutions often used code trivially. A filtering pipeline was developed:
- LLM-based classification (Qwen2.5-32B-Instruct) of code blocks:
- Novelty: Does the code produce a new result or just verify prior steps? (Appendix \ref{sec:TIR_usage_classification})
- Significance: Is the code crucial or easily replaceable by CoT steps? (Appendix \ref{sec:TIR_significance_evaluation})
- Rule-based filtering: Remove incorrect answers, solutions without code, and initially, solutions with >2 code blocks.
- Tag replacement:
python ` -> `<tool_call>`, ` -> </tool_call>.
- This resulted in 15k high-quality "stage-0" TIR samples.
- Iterative Improvement:
- QwQ-32B was fine-tuned on the 15k stage-0 data.
- This model generated 700k TIR solutions, filtered down to 260k (removing incorrect/no-code solutions; novelty/significance filters were dropped here).
- An intermediate 14B model (fine-tuned on CoT data) was trained on these 260k solutions.
- A final round of generation and filtering with this 14B model produced the final 1.7M TIR dataset.
- Controlling Code Executions: A method was developed to control the number of code blocks allowed per generation. The prompt included a message stating remaining allowed executions (Appendix \ref{sec:tir_solution_code_execution_limit}). Generations exceeding the limit were removed during training, teaching the model to respect the constraint (Appendix \ref{lst:tir-code-limit-reached}).
3. Generative Solution Selection (GenSelect)
To improve upon majority voting and approach pass@k performance, a model was trained to select the best solution from multiple candidates.
- Improved Summaries: Original solution summaries (e.g., from DeepSeek-R1) were often too brief. New, longer summaries (up to 2048 tokens) were generated for all solutions in the dataset using Qwen2.5-32B-Instruct, filtering for answer consistency (Appendix \ref{sec:new-summaries}, \ref{sec:summary_comparison}).
- Selection Data Generation:
- For each problem, groups of 2-16 summaries were sampled, ensuring at least one correct and one incorrect solution per group (8 groups per problem).
- QwQ-32B was prompted to reason and select the best solution from each group (Appendix \ref{sec:math_genrm_selection}).
- This generated 1M selections, filtered to 566K by keeping only instances where the correct solution was chosen.
- Efficiency Improvement: Generating full reasoning traces for selection is computationally expensive. To reduce cost, the comparison reasoning traces from QwQ-32B were themselves summarized using Qwen2.5-32B-Instruct (Appendix \ref{sec:gen_summarization}). Training on these summarized comparisons yielded only a 1-2% accuracy drop while significantly speeding up GenSelect inference.
- GenSelect inference is efficient (mostly pre-fill) and works best with <16 candidates; for more candidates, sampling subsets and using majority voting is suggested (Figure \ref{fig:genselect}).
4. OpenMath-Nemotron Models
A series of models (1.5B, 7B, 14B, 32B) were trained based on Qwen2.5-Base.
- Training:
- Supervised fine-tuning (SFT) on the combined 5.5M OpenMathReasoning dataset (CoT, TIR, GenSelect samples). Each task used a unique prompt (Appendix \ref{sec:inference-prompts}) enabling mode switching at inference.
- Context length extended using RoPE base adjustment.
- Training used AdamW, cosine decay, sequence packing (NeMo-Aligner), and checkpoint averaging.
- A second SFT round was performed on a harder subset (2.2M samples, Olympiad problems, low pass-rate, >5k tokens) for 4 epochs (except 32B), improving CoT but slightly degrading TIR (Table \ref{tab:training-accuracy}).
- Evaluation: Models were evaluated on Comp-Math-24-25 and HLE-Math (Table \ref{tab:eval-results}). TIR generally improves majority@k but can hurt pass@1 for smaller models due to more unfinished generations (Table \ref{tab:unfinished-solutions}). GenSelect consistently improves results.
5. Kaggle Submission Implementation
The winning submission used an intermediate 14B model and several practical optimizations for the competition's constraints (4xL4 GPUs, 5-hour limit, offline notebook).
- Model: An earlier 14B model trained differently: 8 epochs SFT on a CoT subset (DeepSeek-R1 only, no proofs), then 400 steps SFT on 15k stage-0 TIR data.
- Model Merging: A simple linear merge (CoT*0.3 + TIR*0.7) of the CoT and TIR checkpoints proved effective, improving accuracy while reducing generation length and code usage compared to the pure TIR model (Table \ref{tab:model-merging}). Mergekit was used for experiments.
- Inference Optimizations:
- TensorRT-LLM: Used for engine conversion, providing in-flight batching, custom kernels, paged KV cache. FP8 quantization offered the best speed/accuracy trade-off (Table \ref{tab:quant-perf-metrics}).
- Speculative Decoding: ReDrafter was used, training a small RNN drafter on 100k generated solutions to propose up to 3 tokens/step (~65% acceptance), significantly increasing tokens/sec. An almost greedy strategy (temp=0) was used.
- Model Serving: A FastAPI backend using Nemo-Skills handled time-constrained generation per question, async batching with early stopping (canceling remaining requests if a consensus answer emerged), and straggler mitigation.
- Time Management: A time buffer system allocated a base time per question, adding unused time to a buffer that subsequent questions could draw from.
- Code Execution: A Nemo-Skills wrapper called a sandboxed Flask environment for parallel Python execution with limits (6 calls/gen, 2s/call, 200 char output).
Conclusion and Release
The paper details the pipeline combining large-scale data generation (CoT), iterative refinement for Tool-Integrated Reasoning (TIR), and Generative Solution Selection (GenSelect) to build state-of-the-art math reasoning models. The full OpenMathReasoning dataset (540K problems, 3.2M CoT, 1.7M TIR, 566K GenSelect solutions), the OpenMath-Nemotron models, and the code (NeMo-Skills) are released under a commercially permissive license.