view article Article BidirLM: Turning Generative LLMs into the Best Open-Source Omnimodal Encoders Nicolas-BZRD • Apr 7 • 28
DenseOn & LateOn Collection A collection of open state-of-the-art single and multi-vector models • 8 items • Updated 3 days ago • 12
TIPSv2 Collection TIPSv2 foundational vision-language models. Webpage: https://gdm-tipsv2.github.io/ • 9 items • Updated Apr 14 • 36
Embarrassingly Simple Self-Distillation Improves Code Generation Paper • 2604.01193 • Published Apr 1 • 56
view article Article Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers tomaarsen • Apr 16 • 72
UltraData Collection Ultra Scale, Ultra Quality, Ultra Coverage • 11 items • Updated 21 days ago • 98
view article Article easytranscriber: Speech Recognition with Accurate Timestamps in the HF Ecosystem KBLab • Mar 3 • 5
view article Article How We Built a Semantic Highlight Model To Save Token Cost for RAG zilliz • Jan 15 • 67
Nemotron-Post-Training-v3 Collection Collection of datasets used in the post-training phase of Nemotron Nano, Super, and Ultra v3. • 50 items • Updated 6 days ago • 158
Common Pile v0.1 Filtered Data Collection An LLM pre-training dataset produced by filtering and deduplicating the raw text collected in the Common Pile v0.1 • 31 items • Updated Jun 6, 2025 • 23
Reward Models 06-2025 Collection Nemotron reward models. For use in RLHF pipelines and LLM-as-a-Judge • 8 items • Updated 6 days ago • 24
SmolVLM: Redefining small and efficient multimodal models Paper • 2504.05299 • Published Apr 7, 2025 • 209
view article Article Training and Finetuning Reranker Models with Sentence Transformers tomaarsen • Mar 26, 2025 • 195
Analyzing and Improving the Training Dynamics of Diffusion Models Paper • 2312.02696 • Published Dec 5, 2023 • 33