Text Generation
Transformers
Safetensors
PyTorch
nvidia
Sharath Turuvekere Sreenivas commited on
Commit
3c59e56
·
verified ·
1 Parent(s): edd6a21

Dataset link

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -142,7 +142,7 @@ The integration of foundation and fine-tuned models into AI systems requires add
142
 
143
  NVIDIA-Nemotron-Nano-12B-v2-Base is pre-trained on a large corpus of high-quality curated and synthetically-generated data. It is trained in the English language, as well as 15 multilingual languages and 43 programming languages. Our sources cover a variety of document types such as: webpages, dialogue, articles, and other written materials. The corpus spans domains including legal, math, science, finance, and more. We also include a small portion of question-answering, and alignment style data to improve model accuracy. The model was trained for approximately twenty trillion tokens.
144
 
145
- Alongside the model, we release our final pretraining data, as outlined in this section. For ease of analysis, there is a sample set that is ungated. For all remaining code, math and multilingual data, gating and approval is required, and the dataset is permissively licensed for model training purposes
146
 
147
  **Data Modality:** Text **The total size:** 10,648,823,153,919 Tokens **Total number of datasets:** 141 **Dataset partition:** *Training \[100%\], testing \[0%\], validation \[0%\]*
148
  **Time period for training data collection:** 2013 to May 1, 2025
 
142
 
143
  NVIDIA-Nemotron-Nano-12B-v2-Base is pre-trained on a large corpus of high-quality curated and synthetically-generated data. It is trained in the English language, as well as 15 multilingual languages and 43 programming languages. Our sources cover a variety of document types such as: webpages, dialogue, articles, and other written materials. The corpus spans domains including legal, math, science, finance, and more. We also include a small portion of question-answering, and alignment style data to improve model accuracy. The model was trained for approximately twenty trillion tokens.
144
 
145
+ Alongside the model, we release our [final pretraining data](https://huggingface.co/collections/nvidia/nemotron-pre-training-dataset-689d9de36f84279d83786b35), as outlined in this section. For ease of analysis, there is a sample set that is ungated. For all remaining code, math and multilingual data, gating and approval is required, and the dataset is permissively licensed for model training purposes
146
 
147
  **Data Modality:** Text **The total size:** 10,648,823,153,919 Tokens **Total number of datasets:** 141 **Dataset partition:** *Training \[100%\], testing \[0%\], validation \[0%\]*
148
  **Time period for training data collection:** 2013 to May 1, 2025