FineData

community

AI & ML interests

We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)

Recent Activity

eliebak submitted a paper 6 days ago

SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations

hynky new activity 10 days ago

HuggingFaceFW/finepdfs:Which language detector did you use

hynky new activity 13 days ago

HuggingFaceFW/finepdfs:The "file_path" data field appears to primarily contain cc-index paths rather than WARC paths.

View all activity

Papers

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

View all Papers

HuggingFaceFW 's collections 7