SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers

Published 28 Jul 2025 in cs.CL | (2507.20527v2)

Abstract: The demand for LLMs capable of sophisticated mathematical reasoning is growing across industries. However, the development of performant mathematical LLMs is critically bottlenecked by the scarcity of difficult, novel training data. We introduce \textbf{SAND-Math} (Synthetic Augmented Novel and Difficult Mathematics problems and solutions), a pipeline that addresses this by first generating high-quality problems from scratch and then systematically elevating their complexity via a new \textbf{Difficulty Hiking} step. We demonstrate the effectiveness of our approach through two key findings. First, augmenting a strong baseline with SAND-Math data significantly boosts performance, outperforming the next-best synthetic dataset by \textbf{$\uparrow$ 17.85 absolute points} on the AIME25 benchmark. Second, in a dedicated ablation study, we show our Difficulty Hiking process is highly effective: by increasing average problem difficulty from 5.02 to 5.98, this step lifts AIME25 performance from 46.38\% to 49.23\%. The full generation pipeline, final dataset, and a fine-tuned model form a practical and scalable toolkit for building more capable and efficient mathematical reasoning LLMs. SAND-Math dataset is released here: \href{https://huggingface.co/datasets/amd/SAND-MATH}{https://huggingface.co/datasets/amd/SAND-MATH}

Abstract PDF Upgrade to Chat

Summary

The paper introduces a synthetic pipeline using LLMs to autonomously generate novel and difficult math problems, addressing data scarcity in mathematical training.
It employs multi-stage filtering including self-consistency, de-duplication, and difficulty hiking to ensure problem correctness, diversity, and elevated complexity.
Experimental results demonstrate that augmenting datasets with SAND-Math samples enhances model performance nearly matching curated human-authored datasets.

SAND-Math: Using LLMs to Generate Novel, Difficult, and Useful Mathematics Questions and Answers

Introduction

The paper "SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers" introduces a synthetic data augmentation pipeline designed to address the scarcity of high-quality, challenging math problems essential for training effective mathematical LLMs. The scarcity stemmed from reliance on existing seeded datasets. This paper explores how utilizing LLM metacognitive abilities allows for the autonomous generation of complex questions, bypassing traditional reliance on seeds.

Multi-stage Pipeline for SAND-Math

The SAND-Math pipeline offers a series of carefully designed stages, each contributing to the creation and refinement of challenging math problems.

Question Generation: Initial questions are generated using a teacher LLM with carefully designed prompts. By leveraging the LLM’s internal grasp of complex mathematical structures, this step ensures diversity and complexity in the initial pool of problems.
Solution Generation and Correctness Filtering: Multiple solutions for each question are generated, allowing for self-consistency filtering. A question is retained only if all generated answers converge, serving as a proxy for correctness.
De-duplication and Decontamination: This step removes near-duplicates within the dataset to ensure diversity and applies semantic similarity checks to prevent leakage when compared against evaluation benchmarks.
Difficulty and Novelty Filtering: Using a target model, problems that the model struggles with are retained to construct the final dataset. The final novelty filter removes questions similar to existing datasets available publicly.
Difficulty Hiking: A novel approach developed in this pipeline, systematically adjusts questions to increase their complexity. Re-prompting the model to integrate additional constraints and concepts elevates question difficulty. $Figure 1$
Figure 1: Data Generation and Filtering pipeline for SAND-Math. Different steps in the pipeline filters the initial pool of questions to meet the Novelty, Correctness and Difficulty requirements.

Experiments and Results

The effectiveness of SAND-Math is demonstrated through rigorous experiments, emphasizing both standalone performance and augmentation efficiency.

Standalone Finetuning: When evaluated as a standalone dataset, SAND-Math shows competitive performance, nearly matching curated human-authored datasets like openr1_math even when trained with limited data samples.
Data Augmentation Performance: SAND-Math proves most valuable as a supplement to existing datasets, significantly boosting model performance beyond what is achievable with either real-world or other synthetic datasets.
Figure 2: Difficulty distribution of SAND-Math (500) dataset compared with other math datasets.

Impact of Difficulty Hiking

The paper details an ablation study highlighting the stark impact of Difficulty Hiking. Problems processed through this method contribute to the largest performance improvements, validating the strategy of increasing problem difficulty to achieve better generalization without additional data volume.

Figure 3: Impact of Difficulty Hiking on Data Distribution. Comparison of question difficulty ratings for a sample of SAND-Math data: before hiking, after hiking, and after_w_lf (with a length filter of 32k).

Figure 4: Performance trend when augmenting the LIMO training data with SAND-Math (SM) samples. The 'DH' (Difficulty Hiked) condition shows greater improvement with 1500 additional samples.

Conclusion

The SAND-Math pipeline marks a significant advancement in synthetic data generation for mathematical LLMs. By facilitating the creation of novel, complex questions independent of existing datasets, it not only enhances model performance but also sets a precedent for future approaches in data augmentation. The successful implementation of Difficulty Hiking heralds a new direction in AI model training strategies, suggesting potential extensions beyond mathematics.

In conclusion, SAND-Math not only fills the gap of difficult math problems in training datasets but also demonstrates the broad applicability of leveraging LLM potential to automate and enhance training data synthesis across various domains. Future directions include scaling the pipeline across larger datasets and exploring domain transfer to other complex reasoning tasks.

Markdown Report Issue