-
DataDecide: How to Predict Best Pretraining Data with Small Experiments
Paper • 2504.11393 • Published • 18 -
Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting LLMs Across Languages and Resources
Paper • 2504.04152 • Published • 1 -
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining
Paper • 2508.10975 • Published • 60 -
Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset
Paper • 2412.02595 • Published • 6
Avinash Benki
AvinashBenkiGnani
AI & ML interests
NLP