view article Article SmolLM3: smol, multilingual, long-context reasoner By loubnabnl and 22 others • Jul 8 • 636
SmolLM3 evaluation datasets Collection Datasets to decontaminate the post-training mixtures against. Use the subset and column values described per entry • 13 items • Updated Jul 8 • 5
SmolLM3 pretraining datasets Collection datasets used in SmolLM3 pretraining • 15 items • Updated 12 days ago • 28
view article Article Bringing Fusion Down to Earth: ML for Stellarator Optimization By cgeorgiaw • Jul 2 • 73
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language Paper • 2506.20920 • Published Jun 26 • 69
view article Article Enhance Your Models in 5 Minutes with the Hugging Face Kernel Hub By drbh and 6 others • Jun 12 • 125
view article Article SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data By danaaubakirova and 8 others • Jun 3 • 234
view article Article Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models By loubnabnl and 2 others • Mar 20, 2024 • 102
view article Article Atlaset Dataset for Moroccan Darija: From Data Collection, Analysis, to Model Trainings By atlasia and 1 other • Mar 6 • 26
view article Article MiniMax-01 is Now Open-Source: Scaling Lightning Attention for the AI Agent Era By MiniMax-AI • Jan 15 • 47
view article Article Arabic RAG Leaderboard: A Comprehensive Framework for Evaluating Arabic Language Retrieval Systems By Navid-AI and 1 other • Feb 9 • 13
view article Article Darija Chatbot Arena: Making LLMs Compete in the Moroccan Dialect By atlasia and 2 others • Feb 10 • 14
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model Paper • 2502.02737 • Published Feb 4 • 242