Running Featured 1.23k FineWeb: decanting the web for the finest text data at scale 🍷 1.23k Generate high-quality text data for LLMs using FineWeb
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale Paper • 2406.17557 • Published Jun 25, 2024 • 99
📀 Dataset comparison models Collection 1.8B models trained on 350BT to compare different pretraining datasets • 8 items • Updated Jun 12, 2024 • 42
🧪 FineWeb v1 data experiments Collection Ablation models trained for our data experiments. • 22 items • Updated Jun 12, 2024 • 8