FineData
community
AI & ML interests
We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)
Recent Activity
View all activity
Papers
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
-
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
Paper • 2506.20920 • Published • 75 -
HuggingFaceFW/fineweb-2
Viewer • Updated • 4.48B • 60k • 707 -
Scaling FineWeb to 1000+ languages: Step 1: finding signal in 100s of evaluation tasks
📝85Evaluate multilingual models using FineTasks
FineWeb-Edu datasets, classifier and ablation model
-
HuggingFaceFW/fineweb-edu
Viewer • Updated • 3.5B • 300k • 883 -
HuggingFaceFW/fineweb-edu-score-2
Viewer • Updated • 13.9B • 28.5k • 82 -
HuggingFaceFW/fineweb-edu-classifier
Text Classification • 0.1B • Updated • 12.8k • • 202 -
HuggingFaceFW/ablation-model-fineweb-edu
Text Generation • 2B • Updated • 120 • 18
Ablation models trained for our data experiments.
-
HuggingFaceFW/ablation-exp-textext-warc_trafilatura-28BT
Text Generation • 2B • Updated • 20 • 1 -
HuggingFaceFW/ablation-exp-textext-wet-28BT
Text Generation • 2B • Updated • 8 -
HuggingFaceFW/ablation-exp-fw-base_filtering-350BT
Text Generation • 2B • Updated • 10 -
HuggingFaceFW/ablation-exp-dedup-global_minhash-350BT
Text Generation • 2B • Updated • 13
-
FineWeb: decanting the web for the finest text data at scale
🍷1.23kGenerate high-quality text data for LLMs using FineWeb
-
HuggingFaceFW/fineweb
Viewer • Updated • 52.5B • 171k • 2.55k -
HuggingFaceFW/fineweb-edu
Viewer • Updated • 3.5B • 300k • 883 -
HuggingFaceFW/fineweb-edu-score-2
Viewer • Updated • 13.9B • 28.5k • 82
1.8B models trained on 350BT to compare different pretraining datasets
-
HuggingFaceFW/ablation-model-fineweb-edu
Text Generation • 2B • Updated • 120 • 18 -
HuggingFaceFW/ablation-model-fineweb-v1
Text Generation • 2B • Updated • 49 • 14 -
HuggingFaceFW/ablation-model-refinedweb
Text Generation • 2B • Updated • 13 • 3 -
HuggingFaceFW/ablation-model-c4
Text Generation • 2B • Updated • 9 • 4
-
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
Paper • 2506.20920 • Published • 75 -
HuggingFaceFW/fineweb-2
Viewer • Updated • 4.48B • 60k • 707 -
Scaling FineWeb to 1000+ languages: Step 1: finding signal in 100s of evaluation tasks
📝85Evaluate multilingual models using FineTasks
-
FineWeb: decanting the web for the finest text data at scale
🍷1.23kGenerate high-quality text data for LLMs using FineWeb
-
HuggingFaceFW/fineweb
Viewer • Updated • 52.5B • 171k • 2.55k -
HuggingFaceFW/fineweb-edu
Viewer • Updated • 3.5B • 300k • 883 -
HuggingFaceFW/fineweb-edu-score-2
Viewer • Updated • 13.9B • 28.5k • 82
FineWeb-Edu datasets, classifier and ablation model
-
HuggingFaceFW/fineweb-edu
Viewer • Updated • 3.5B • 300k • 883 -
HuggingFaceFW/fineweb-edu-score-2
Viewer • Updated • 13.9B • 28.5k • 82 -
HuggingFaceFW/fineweb-edu-classifier
Text Classification • 0.1B • Updated • 12.8k • • 202 -
HuggingFaceFW/ablation-model-fineweb-edu
Text Generation • 2B • Updated • 120 • 18
1.8B models trained on 350BT to compare different pretraining datasets
-
HuggingFaceFW/ablation-model-fineweb-edu
Text Generation • 2B • Updated • 120 • 18 -
HuggingFaceFW/ablation-model-fineweb-v1
Text Generation • 2B • Updated • 49 • 14 -
HuggingFaceFW/ablation-model-refinedweb
Text Generation • 2B • Updated • 13 • 3 -
HuggingFaceFW/ablation-model-c4
Text Generation • 2B • Updated • 9 • 4
Ablation models trained for our data experiments.
-
HuggingFaceFW/ablation-exp-textext-warc_trafilatura-28BT
Text Generation • 2B • Updated • 20 • 1 -
HuggingFaceFW/ablation-exp-textext-wet-28BT
Text Generation • 2B • Updated • 8 -
HuggingFaceFW/ablation-exp-fw-base_filtering-350BT
Text Generation • 2B • Updated • 10 -
HuggingFaceFW/ablation-exp-dedup-global_minhash-350BT
Text Generation • 2B • Updated • 13