📢 NVIDIA Releases Nemotron-CC-Math Pre-Training Dataset: A High-Quality, Web-Scale Math Corpus for Pretraining Large Language Models

Community Article Published August 18, 2025

➡️ Dataset page: Nemotron-CC-Math

📜 License: NVIDIA Open Data License Agreement

🧠 Paper: Nemotron-cc-math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset

Highlights

We’re excited to release Nemotron-CC-Math, a large-scale, high-quality math corpus extracted from Common Crawl. This dataset significantly raises the bar for open-source math pretraining corpora, outperforming prior datasets like FineMath, MegaMath, and OpenWebMath on math benchmarks, while achieving comparable or better results on code and general reasoning.

We included Nemotron-CC-Math in the pretraining mixture for the NVIDIA Nano V2 12B/9B models. To illustrate its impact, we pretrained Nano-sized models using this dataset and compared them to other open models across various math benchmarks: GSM8K CoT, MATH, MATH Level 5, and AIME 2024. Including Nemotron-CC-Math in the pretraining mixture clearly boosts math reasoning performance across all tasks, especially on more challenging datasets like MATH Level 5 and AIME.

image/png

✨ Why Build a New Math Corpus?

High-quality math datasets are critical for improving reasoning, symbolic understanding, and general intelligence in large language models (LLMs). However, most existing open math corpora suffer from:

  • Brittle extraction pipelines
  • Lossy HTML-to-text conversions
  • Missing or corrupted equations
  • Inconsistent formatting and low data fidelity

Many of the best-performing proprietary LLMs (e.g., Minerva, DeepSeekMath, Qwen-Math) rely on large, unreleased math corpora. To support the open research community, we built Nemotron-CC-Math from scratch using a new domain-agnostic extraction pipeline designed for scientific content.


🔍 What’s Inside the Dataset?

Nemotron-CC-Math comes in two variants — nemotron-cc-math-3plus and nemotron-cc-math-4plus — created by classifying data with our FineMath classifier. In this scheme, 3plus corresponds to samples scoring 3, 4, or 5, while 4plus includes only samples scoring 4 or 5. Our dataset is constructed from 98 Common Crawl snapshots (2014–2024). In total, we process content from over 980,000 unique domains, making it one of the most diverse math corpora available. We also regenerated the Nemotron-MIND dataset using nemotron-cc-math-4plus, our high-quality subset, which yielded consistent gains over previous Nemotron-MIND.

Dataset # Tokens # Documents
nemotron-cc-math-3plus 133B 101.15M
nemotron-cc-math-4plus 52B 45.10M
nemotron-mind-v1 73B 88.73M

🔨 How We Built It

We developed a robust, scalable pipeline tailored to mathematical and scientific content. The key components:

1. Lynx-based Rendering

Instead of relying on brittle DOM parsing, we use the lynx text browser to render HTML into structured text—preserving equations, symbols, and indentation.

2. LLM-based Cleaning

We pass rendered documents through a lightweight language model (Phi-4, 14B) to:

  • Remove boilerplate (headers, footers, navbars, etc.)
  • Normalize mathematical expressions into consistent LaTeX
  • Improve formatting and clarity

3. Quality Filtering

We use a math-specific quality classifier (from FineMath) to assign a quality score from 1–5 to each page.

4. Fuzzy Deduplication

We apply MinHash-based Locality Sensitive Hashing (LSH) via NeMo-Curator to remove near-duplicate documents at scale.

5. Benchmark Decontamination

To ensure trustworthy evaluation, we apply LLM-based contamination detection against test benchmarks like MATH, GSM8K, MMLU and MMLU-Pro, using the same techniques as LLM Decontaminator.


📈 Results: How Good Is the Data?

We ran mid-training ablations on 8B sized models using this corpus and compared against prior math pretraining datasets including, OpenWebMath, MegaMath, FineMath

🧮 Math Reasoning

Dataset MATH (EM) GSM8K (EM)
OpenWebMath 34.2 76.42
FineMath-3+ 34.6 79.45
MegaMath-Web 31.6 78.24
Nemotron-CC-Math-3+ 44.20 80.06

💻 Code Generation

Dataset HumanEval+ (average@20) MBPP+ (average@20)
OpenWebMath 33.54 37.59
FineMath-3+ 34.18 29.19
MegaMath-Web 32.29 38.89
Nemotron-CC-Math-3+ 37.16 43.51

🧠 General Knowledge (MMLU)

Dataset MMLU (EM) MMLU-STEM (EM)
OpenWebMath 65.20 59.20
FineMath-3+ 67.92 62.29
MegaMath-Web 65.44 59.88
Nemotron-CC-Math-3+ 68.20 64.26

📊 Nemotron-MIND Improvements

Using Nemotron-CC-Math-4plus to regenerate Nemotron-MIND leads to substantial improvements across math, code, and general reasoning tasks:

Dataset #Unique Tokens (B) MMLU Pro MMLU MMLU STEM Code Math-500 GSM8K
Nemotron-MIND 126 36.1 66.1 60 43.4 33.4 80.7
Nemotron-MIND-V1 73 39.7 67.5 63.7 44.2 47.8 84.5

🔍 Qualitative Examples

We present a side-by-side comparison between our dataset and prior work (MegaMath). The illustrative samples highlight how our model preserves mathematical equations, in contrast to existing approaches where such structures are often lost or distorted.

image/png

image/png image/png


📦 Get Started

The datasets are uploaded as 3 huggingface dataset subsets - 3 (documents with quality label 3), 4plus (documents with quality labels 4 and 5) and 4plus_MIND (MIND method applied to the 4plus subsset). To build the 3plus subset, load both 3 and 4plus subsets.

You can download the dataset directly from the Hugging Face Hub:

pip install datasets

from datasets import load_dataset

ds = load_dataset("nvidia/Nemotron-CC-Math-v1", "4plus", streaming=True)

🔓 Open-Source Everything

We believe high-quality pretraining data should be open. Thats why we will release our full processing pipline (HTML parsing, cleaning, deduplication, filtering) and our dataset.

🤝 Citation & Acknowledgment

If you use our dataset in your research, please cite:

@article{karimi2025nemotroncc,
  title = {Nemotron-cc-math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset},
  author = {Rabeeh Karimi Mahabadi, Sanjeev Satheesh, Shrimai Prabhumoye, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro },
  url = {https://arxiv.org/abs/2508.15096},
  year = {2025}
}

Community

Sign up or log in to comment