Nemotron-CC-Math: High-Fidelity Math Corpus
- Nemotron-CC-Math is a large-scale, high-fidelity math corpus preserving precise LaTeX formatting and code structure via advanced extraction and cleaning techniques.
- The dataset employs layout-aware rendering using Lynx and LLM-based cleaning to maintain mathematical and code integrity from diverse web sources.
- Its quality-filtered subsets, containing up to 133 billion tokens, significantly boost LLM performance in mathematical reasoning and code synthesis benchmarks.
Nemotron-CC-Math is a large-scale, high-fidelity mathematical corpus for pretraining LLMs, constructed from Common Crawl with a purpose-built, domain-agnostic scientific text extraction pipeline. The dataset addresses key deficits in prior web-scale math corpora, such as degraded mathematical structure, lossy formatting, and inefficient heuristics, by leveraging layout-aware rendering, advanced LLM-based cleaning, and systematic deduplication and decontamination. Nemotron-CC-Math is released in two quality-filtered subsets, comprising up to 133 billion tokens and demonstrating measurable gains in mathematical reasoning, code synthesis, and general-domain tasks when used to pretrain LLMs (Mahabadi et al., 20 Aug 2025).
1. Construction Pipeline and Workflow
Nemotron-CC-Math employs a multi-stage, domain-agnostic extraction and cleaning pipeline to ensure robust preservation of mathematical and code structure. The workflow initiates with a seed list of math-related URLs sourced from leading open-math datasets, including OpenWebMath, FineMath, MegaMath, and InfiMM-WebMath. For each URL, raw HTML data is retrieved from 98 Common Crawl snapshots spanning 2014–2024.
Text rendering is performed using Lynx, a terminal-based browser, which executes CSS-based layout rules and directly interprets embedded MathJax, KaTeX, and MathML content. This approach mitigates the brittle heuristics and structural losses typical of DOM scraping, faithfully capturing inline and block-level formulas, code indentation, and Unicode symbols.
A specialized cleaning phase processes the Lynx output with Phi-4 (a 14B parameter LLM), guided by a structured prompt. This stage excises navigation elements, advertisements, and boilerplate, standardizes all math expressions into LaTeX delimited by single dollars for inline and double dollars for display math, and rectifies common notational inconsistencies. Code snippets are preserved with their original syntactic structure, including fenced code blocks.
Mathematical integrity is further protected: MathML, MathJax, and KaTeX content are faithfully transcribed, with fallback rewriting to LaTeX for formula images or malformed markup via the LLM. Inline and display math are semantically differentiated. Notational harmonization is enforced, such as unifying Unicode or HTML entities to canonical LaTeX and correcting inconsistencies (e.g., always using \mathrm{d}x).
The dataset then undergoes quality filtering using a 5-point FineMath classifier, producing two subsets: Nemotron-CC-Math-4+ (highest quality; scores 4–5) and Nemotron-CC-Math-3+ (broader coverage; scores 3–5). Fuzzy deduplication applies MinHash LSH (20 bands, 13 hash functions) over 24-gram shingles, removing near-duplicate documents with Jaccard similarity above 0.8. Decontamination uses Qwen2.5B-32B embeddings, excising any document with cosine similarity above 0.9 to popular benchmark prompts or answers, minimizing train-test leakage.
2. Dataset Composition and Statistics
Nemotron-CC-Math is released in two primary versions distinguished by their quality filtering:
- Nemotron-CC-Math-3+: 133.26 billion tokens from 101 million documents.
- Nemotron-CC-Math-4+: 52.32 billion tokens from 45 million documents.
The topical distribution was estimated from a sample of 150,000 documents, with mathematics comprising 60.3%, computer science 12.0%, physics 11.2%, statistics 7.5%, economics 3.2%, chemistry 1.7%, and other fields 4.1%. The corpus spans approximately 980,000 unique domains, with 43% of documents from the top 100 domains.
Incidental code coverage is significant: Nemotron-CC-Math-4+ contains approximately 1.44 million code block examples, and Nemotron-CC-Math-3+ includes around 4.3 million. This broad inclusion enables improvements in code reasoning for LLMs in addition to mathematical problem-solving.
3. Comparative Analysis with Existing Math Datasets
Independent benchmarks demonstrate substantially increased scale and quality for Nemotron-CC-Math relative to previous open-source math corpora.
| Dataset | Documents (M) | Tokens (B) |
|---|---|---|
| OpenWebMath | 6.3 | 14.7 |
| InfiMM-WebMath-4+ | 6.3 | 8.5 |
| FineMath-4+ | 6.7 | 9.6 |
| MegaMath-Pro | 15.0 | 15.1 |
| Nemotron-CC-Math-4+ | 45.1 | 52.3 |
Nemotron-CC-Math-4+ is 5.5× larger than FineMath-4+, and 3.5× larger than MegaMath-Pro, the next-largest high-quality subset. The LLM-driven cleaning in Nemotron-CC-Math preserves the full LaTeX structure of mathematical expressions, avoiding common issues in prior pipelines that strip or misformat math content. The combined use of Lynx-based rendering and LLM standardization enables comprehensive recovery of formulas from a broader range of source encodings, including images and malformed markup, increasing both coverage and fidelity relative to previous efforts.
4. Empirical Pretraining Results
Pretraining experiments with the Nemotron-T 8B model use data blends where math corpus tokens are upweighted to 30% under two different budgets: 100B and 300B math tokens. Performance was evaluated on several established benchmarks, including MATH (EM), GSM8K (EM), MBPP+ (avg@20), HumanEval+ (avg@20), and MMLU-Stem (EM).
| Benchmark | FineMath-4+ | MegaMath-Pro | Nemotron-CC-Math-4+ |
|---|---|---|---|
| MATH (EM) | 35.8 | 34.0 | 40.6 |
| GSM8K (EM) | 76.0 | 73.5 | 76.3 |
| MBPP+ (avg@20) | 28.9 | 46.0 | 45.1 |
| HumanEval+ (avg@20) | 32.2 | 31.0 | 34.8 |
| MMLU-Stem (EM) | 61.6 | 60.9 | 62.7 |
For models trained on 100B math tokens, Nemotron-CC-Math-4+ demonstrates +4.8 to +12.6 point EM gains on MATH benchmarks versus FineMath-4+ and MegaMath-Pro, and +4.6 to +14.3 avg@20 improvements on code generation in MBPP+. At higher pretraining budgets, gains are further accentuated, particularly in domains requiring mathematical reasoning and code synthesis. General-knowledge and STEM benchmarks (MMLU, MMLU-Stem) also improve, presumably due to the high structural fidelity and wide topical coverage.
5. Release, Licensing, and Recommended Usage
Nemotron-CC-Math is distributed under an Apache-2.0 license as part of the NeMo-Curator toolkit and is available via Hugging Face at https://huggingface.co/datasets/nvidia/Nemotron-CC-Math-v1. The extraction and cleaning pipeline, including configuration and prompt details, can be reproduced using the NeMo-Curator repository and the provided “math_extraction_pipeline” notebook.
It is recommended to blend Nemotron-CC-Math at 20–40% of total pretraining tokens, selecting either the 3+ or 4+ subset depending on computational budget and quality requirements. Although the dataset is extensively decontaminated, users are advised to rerun decontamination against any new downstream benchmarks. The topical skew toward mathematical forums and educational sites is optimal for math reasoning LLM development but may motivate further domain-specific corpus augmentation for other scientific applications. Occasional hallucinated notation errors may arise from the LLM-based cleanup; examples and templates are provided to facilitate further refinement if required.
6. Limitations and Considerations
Several limitations are noted. Despite comprehensive decontamination strategies, complete elimination of benchmark contamination cannot be absolutely guaranteed; users should continually update decontamination procedures as benchmarks evolve. The topical balance is weighted toward mathematical and computational science web sources, which is appropriate for reasoning research but less so for specialized domains. LLM cleaning can sporadically introduce minor notational artifacts; instruction prompts and empirical examples are provided for adaptation or further fine-tuning. A plausible implication is that while the dataset sets a new standard for open-source mathematical pretraining corpora, domain-specific needs not fully represented in web sources may still require supplemental data collection and curation.
7. Significance and Research Applications
Nemotron-CC-Math offers a state-of-the-art benchmark for high-fidelity, large-scale math pretraining, directly enabling improved performance in LLMs across mathematics, code reasoning, and general knowledge. Its methodology, combining domain-agnostic extraction, layout-aware rendering, and LLM-based structural cleaning, represents a significant advance in scalable corpus construction for scientific text. The open-source release, comprehensive deduplication, and systematic decontamination support widespread adoption and reproducibility. The dataset forms a cornerstone for future advancements in automated mathematical reasoning and scientific code synthesis for LLMs (Mahabadi et al., 20 Aug 2025).