📢 NVIDIA Releases Nemotron-CC-Math Pre-Training Dataset: A High-Quality, Web-Scale Math Corpus for Pretraining Large Language Models
➡️ Dataset page: Nemotron-CC-Math
📜 License: NVIDIA Open Data License Agreement
🧠 Paper: Nemotron-cc-math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset
Highlights
We’re excited to release Nemotron-CC-Math, a large-scale, high-quality math corpus extracted from Common Crawl. This dataset significantly raises the bar for open-source math pretraining corpora, outperforming prior datasets like FineMath, MegaMath, and OpenWebMath on math benchmarks, while achieving comparable or better results on code and general reasoning.
We included Nemotron-CC-Math in the pretraining mixture for the NVIDIA Nano V2 12B/9B models. To illustrate its impact, we pretrained Nano-sized models using this dataset and compared them to other open models across various math benchmarks: GSM8K CoT, MATH, MATH Level 5, and AIME 2024. Including Nemotron-CC-Math in the pretraining mixture clearly boosts math reasoning performance across all tasks, especially on more challenging datasets like MATH Level 5 and AIME.
✨ Why Build a New Math Corpus?
High-quality math datasets are critical for improving reasoning, symbolic understanding, and general intelligence in large language models (LLMs). However, most existing open math corpora suffer from:
- Brittle extraction pipelines
- Lossy HTML-to-text conversions
- Missing or corrupted equations
- Inconsistent formatting and low data fidelity
Many of the best-performing proprietary LLMs (e.g., Minerva, DeepSeekMath, Qwen-Math) rely on large, unreleased math corpora. To support the open research community, we built Nemotron-CC-Math from scratch using a new domain-agnostic extraction pipeline designed for scientific content.
🔍 What’s Inside the Dataset?
Nemotron-CC-Math comes in two variants — nemotron-cc-math-3plus and nemotron-cc-math-4plus — created by classifying data with our FineMath classifier. In this scheme, 3plus corresponds to samples scoring 3, 4, or 5, while 4plus includes only samples scoring 4 or 5. Our dataset is constructed from 98 Common Crawl snapshots (2014–2024). In total, we process content from over 980,000 unique domains, making it one of the most diverse math corpora available. We also regenerated the Nemotron-MIND dataset using nemotron-cc-math-4plus, our high-quality subset, which yielded consistent gains over previous Nemotron-MIND.
Dataset | # Tokens | # Documents |
---|---|---|
nemotron-cc-math-3plus |
133B | 101.15M |
nemotron-cc-math-4plus |
52B | 45.10M |
nemotron-mind-v1 |
73B | 88.73M |
🔨 How We Built It
We developed a robust, scalable pipeline tailored to mathematical and scientific content. The key components:
1. Lynx-based Rendering
Instead of relying on brittle DOM parsing, we use the lynx
text browser to render HTML into structured text—preserving equations, symbols, and indentation.
2. LLM-based Cleaning
We pass rendered documents through a lightweight language model (Phi-4, 14B) to:
- Remove boilerplate (headers, footers, navbars, etc.)
- Normalize mathematical expressions into consistent LaTeX
- Improve formatting and clarity
3. Quality Filtering
We use a math-specific quality classifier (from FineMath) to assign a quality score from 1–5 to each page.
4. Fuzzy Deduplication
We apply MinHash-based Locality Sensitive Hashing (LSH) via NeMo-Curator to remove near-duplicate documents at scale.
5. Benchmark Decontamination
To ensure trustworthy evaluation, we apply LLM-based contamination detection against test benchmarks like MATH, GSM8K, MMLU and MMLU-Pro, using the same techniques as LLM Decontaminator.
📈 Results: How Good Is the Data?
We ran mid-training ablations on 8B sized models using this corpus and compared against prior math pretraining datasets including, OpenWebMath, MegaMath, FineMath
🧮 Math Reasoning
Dataset | MATH (EM) | GSM8K (EM) |
---|---|---|
OpenWebMath | 34.2 | 76.42 |
FineMath-3+ | 34.6 | 79.45 |
MegaMath-Web | 31.6 | 78.24 |
Nemotron-CC-Math-3+ | 44.20 | 80.06 |
💻 Code Generation
Dataset | HumanEval+ (average@20) | MBPP+ (average@20) |
---|---|---|
OpenWebMath | 33.54 | 37.59 |
FineMath-3+ | 34.18 | 29.19 |
MegaMath-Web | 32.29 | 38.89 |
Nemotron-CC-Math-3+ | 37.16 | 43.51 |
🧠 General Knowledge (MMLU)
Dataset | MMLU (EM) | MMLU-STEM (EM) |
---|---|---|
OpenWebMath | 65.20 | 59.20 |
FineMath-3+ | 67.92 | 62.29 |
MegaMath-Web | 65.44 | 59.88 |
Nemotron-CC-Math-3+ | 68.20 | 64.26 |
📊 Nemotron-MIND Improvements
Using Nemotron-CC-Math-4plus to regenerate Nemotron-MIND leads to substantial improvements across math, code, and general reasoning tasks:
Dataset | #Unique Tokens (B) | MMLU Pro | MMLU | MMLU STEM | Code | Math-500 | GSM8K |
---|---|---|---|---|---|---|---|
Nemotron-MIND | 126 | 36.1 | 66.1 | 60 | 43.4 | 33.4 | 80.7 |
Nemotron-MIND-V1 | 73 | 39.7 | 67.5 | 63.7 | 44.2 | 47.8 | 84.5 |
🔍 Qualitative Examples
We present a side-by-side comparison between our dataset and prior work (MegaMath). The illustrative samples highlight how our model preserves mathematical equations, in contrast to existing approaches where such structures are often lost or distorted.
📦 Get Started
The datasets are uploaded as 3 huggingface dataset subsets - 3 (documents with quality label 3), 4plus (documents with quality labels 4 and 5) and 4plus_MIND (MIND method applied to the 4plus subsset). To build the 3plus subset, load both 3 and 4plus subsets.
You can download the dataset directly from the Hugging Face Hub:
pip install datasets
from datasets import load_dataset
ds = load_dataset("nvidia/Nemotron-CC-Math-v1", "4plus", streaming=True)
🔓 Open-Source Everything
We believe high-quality pretraining data should be open. Thats why we will release our full processing pipline (HTML parsing, cleaning, deduplication, filtering) and our dataset.
🤝 Citation & Acknowledgment
If you use our dataset in your research, please cite:
@article{karimi2025nemotroncc,
title = {Nemotron-cc-math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset},
author = {Rabeeh Karimi Mahabadi, Sanjeev Satheesh, Shrimai Prabhumoye, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro },
url = {https://arxiv.org/abs/2508.15096},
year = {2025}
}