NuminaMath-7B-TIR Fine-Tuning

Updated 29 December 2025

NuminaMath-7B-TIR is a fine-tuning framework leveraging a 50,000-entry subset of the NuminaMath dataset to improve reasoning via structured critique data.
The methodology employs GPT-4o-1120 generated critiques on mathematically noisy solutions and evaluates using exact-match accuracy through zero-temperature greedy decoding.
Comparative experiments show that Critique Fine-Tuning yields consistent accuracy gains, with improvements up to +13.3% over traditional supervised fine-tuning.

NuminaMath-7B-TIR refers to the evaluation and fine-tuning of 7 B-parameter transformer models on the NuminaMath dataset using methodologies articulated in the context of Critique Fine-Tuning (CFT), as detailed by (Wang et al., 29 Jan 2025). The term "TIR (Test Input Response)" is not employed in the referenced work; all evaluation procedures utilize exact-match accuracy in standard zero-temperature greedy decoding mode. The focus is specifically on leveraging the structured critique data within NuminaMath to enhance mathematical reasoning in LLMs.

1. Structure and Origin of the NuminaMath Dataset

NuminaMath is a large-scale public math-instruction corpus totaling 860,000 problem–solution pairs. The dataset draws extensively from prominent evaluation suites and competition problem sources such as GSM8K, MATH, and AIME. Each entry in the native dataset is a triplet encompassing:

$x$ : a mathematics contest problem in plain text, often containing LaTeX-formatted expressions,
$y$ : a “noisy” solution, potentially chemically flawed with algebraic or logical errors,
$c$ : a step-wise expert critique of the presented solution.

For the CFT process, a subset of 50,000 entries from NuminaMath is randomly sampled. The sampled data exclusively utilizes the “noisy” $y$ as the solution to be critiqued, with correct (“gold”) answers discarded from the prompt. Critiques $c$ are synthesized using the GPT-4o-1120 model, producing granular assessments that identify correct and incorrect solution steps, locate errors, and render a conclusive judgment. This replicates the critique-generation paradigm also used for WebInstruct and MetaMath subsets (Wang et al., 29 Jan 2025).

2. Critique Generation and Data Curation Methodology

For the CFT subset, the data preparation pipeline consists of:

Extraction of 50,000 $(x, y)$ pairs, where $y$ is the initially provided (possibly erroneous) solution from NuminaMath.
Automated critique assignment using GPT-4o-1120, prompted with: “Please critique whether the following solution to the question is correct. Question: [problem text]. Solution: [noisy response]. Critique: 1.… 2.… Conclusion: Correct/Incorrect.”
No synthetic noise is injected into responses; all critiques reflect genuine model or dataset errors in $y$ .
Sample entries contain granular, line-by-line error analysis, concluding with a summary of correctness and completeness.

This methodology ensures each example supports fine-grained discriminative learning, as opposed to the imitation of verified solutions.

3. Model Architecture and Fine-Tuning Regimen

The CFT experiments employ Qwen2.5-Math-7B, a 7 B-parameter autoregressive transformer model with training on mathematical corpora but lacking explicit instruction tuning. Fine-tuning proceeds as follows:

Objective: Maximize the log-likelihood $\log P(c \mid [x; y]; \theta)$ , where the model is conditioned on both the prompt ( $x$ ) and the noisy solution ( $y$ 0), targeting the critique ( $y$ 1).
Data: 50,000 $y$ 2 tuples curated from NuminaMath with GPT-4o-generated critiques.
Batch size: 512 (global).
Learning rate: $y$ 3, with cosine decay scheduling and 10% warm-up phase.
Number of epochs: 1, with checkpoints selected on MATH-500 validation performance.
Compute resources are not specified; analogous runs suggest small A100-40 GB GPU clusters.

The 7 B-parameter scale is retained for all comparative experiments within this ablation.

4. Evaluation Protocol and Metrics

The referenced study does not utilize any “TIR (Test Input Response)” protocol. All model evaluation is performed via standard zero-temperature greedy decoding (temperature = 0.0) on held-out test splits. The central evaluation metric is exact-match accuracy: the proportion of test problems for which the model’s generated solution exactly matches the correct reference. This standard aligns with prevailing benchmarks for mathematical reasoning corpora such as MATH, GSM8K, OlympiadBench, AIME24, and AMC23.

5. Performance Comparison: Critique Fine-Tuning vs. Supervised Fine-Tuning

Ablation results for Qwen2.5-Math-7B trained on the 50,000-entry NuminaMath subset reveal that CFT confers consistent, substantial accuracy improvements relative to supervised fine-tuning (SFT, i.e., imitation learning). Table 1 summarizes the observed performance:

Task	SFT Accuracy (%)	CFT Accuracy (%)	Δ = CFT–SFT (%)
MATH	70.8	74.2	+3.4
Minerva-Math	28.3	30.5	+2.2
GSM8K	88.3	89.1	+0.8
OlympiadBench	36.3	37.2	+0.9
AIME24	10.0	23.3	+13.3
AMC23	50.0	62.5	+12.5
Average	47.3	52.8	+5.5

Across all tasks, CFT yields an absolute gain of 5.5 percentage points over SFT, with the magnitude of improvement ranging from +0.8% (GSM8K) to +13.3% (AIME24). Improvements are consistent for all problem types, and gains persist even under ablation of noisy answer sources or critique generation models.

6. Format and Nature of CFT Data Examples

The data records used in CFT training consist of:

Query ( $y$ 4): Mathematical contest problem in plaintext, possibly including latexified expressions.
Noisy Response ( $y$ 5): The original NuminaMath solution candidate.
Critique ( $y$ $y$ 6): Expert evaluation synthesized by GPT-4o-1120, detailing the following:
- Identification and validation of individual solution steps,
- Explicit marking of algebraic, logical, or procedural errors,
- Summative judgment (“Conclusion: Correct/Incorrect”).

A representative (generic) data example is as follows:

$y$ 7

Such structure reinforces model sensitivity to both step-level and holistic reasoning errors.

7. Comparative and Broader Significance

Comparative analysis across all tested 7 B-parameter models (DeepSeek-Math-7B, Qwen2.5-7B, Qwen2.5-Math-7B) demonstrates that CFT consistently outperforms SFT by margins of 3–11 absolute percentage points, depending on the exact dataset and evaluation suite. The gains persist across variants in noisy answer provenance and teacher critique model (GPT-4o-mini or GPT-4o-1120). On NuminaMath exclusively, Qwen2.5-Math-7B-CFT improves test accuracy from 47.3% to 52.8% (+5.5%).

A plausible implication is that structured critique-based learning, rather than imitation of canonical answers, drives more robust acquisition of generalized reasoning and error correction strategies in mathematical LLMs. The findings generalize to other benchmarked domains (WebInstruct, MetaMath), suggesting a data- and compute-efficient pathway for further maturation of mathematical reasoning capabilities (Wang et al., 29 Jan 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NuminaMath-7B-TIR.