Efficient Memory Management for Large Language Model Serving with PagedAttention Paper • 2309.06180 • Published Sep 12, 2023 • 34
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion Paper • 2503.11576 • Published Mar 14, 2025 • 138
view article Article Performant local mixture-of-experts CPU inference with GPU acceleration in llama.cpp 5 days ago • 9
view article Article LightOnOCR-2-1B: a lightweight high-performance end-to-end OCR model family 15 days ago • 76
view article Article Tokenization in Transformers v5: Simpler, Clearer, and More Modular +4 Dec 18, 2025 • 119
view article Article Shrinking Giants: The Quantization Mathematics Making LLMs Accessible May 3, 2025 • 2
view article Article A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using transformers, accelerate and bitsandbytes Aug 17, 2022 • 123
Running on CPU Upgrade Featured 2.95k The Smol Training Playbook 📚 2.95k The secrets to building world-class LLMs
Running 3.67k The Ultra-Scale Playbook 🌌 3.67k The ultimate guide to training LLM on large GPU Clusters