Running 1.04k 1.04k FineWeb: decanting the web for the finest text data at scale 🍷 Generate high-quality web text data for LLM training
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale Paper • 2406.17557 • Published Jun 25, 2024 • 98
📀 Dataset comparison models Collection 1.8B models trained on 350BT to compare different pretraining datasets • 8 items • Updated Jun 12, 2024 • 40
🧪 FineWeb v1 data experiments Collection Ablation models trained for our data experiments. • 22 items • Updated Jun 12, 2024 • 6