Building on HF

1877 377 803

merve PRO

merve

maxjakob's profile picture

Pent's profile picture

admarcosai's profile picture

https://github.com/merveenoyan/smol-vision

mervenoyann
merveenoyan
merve.bsky.social

AI & ML interests

I love this website VLMs, vision & co

Recent Activity

liked a model 13 days ago

nvidia/magpie_tts_multilingual_357m

liked a model 13 days ago

zai-org/GLM-4.7

updated a dataset 13 days ago

merve/personal-website

View all activity

Organizations

merve 's collections 81

Dec 19 Releases

nvidia/NitroGen

Updated 17 days ago • 447
google/gemma-scope-2

Updated 16 days ago • 58
FunAudioLLM/Fun-ASR-MLT-Nano-2512

Updated 13 days ago • 133 • 32
facebook/map-anything-v1

Image-to-3D • 0.6B • Updated 17 days ago • 440 • 19

Real-time Vision Models

A collection of real-time detectors.

PekingU/rtdetr_v2_r50vd

Object Detection • 43M • Updated Feb 6, 2025 • 153k • 26
ustc-community/dfine-xlarge-obj365

Object Detection • 63.4M • Updated May 5, 2025 • 621 • 4
PekingU/rtdetr_v2_r101vd

Object Detection • 76.8M • Updated Feb 6, 2025 • 4.21k • 13
Running on T4

113

RF-DETR

🔥

113

SOTA real-time object detection model

MetaCLIP2 Multilingual

facebook/metaclip-2-worldwide-s16

Zero-Shot Image Classification • 0.4B • Updated Nov 12, 2025 • 83 • 8
facebook/metaclip-2-worldwide-m16

Zero-Shot Image Classification • 0.5B • Updated Nov 12, 2025 • 8 • 3
facebook/metaclip-2-worldwide-l14

Zero-Shot Image Classification • 1B • Updated Nov 12, 2025 • 222 • 12
facebook/metaclip-2-worldwide-b32

Zero-Shot Image Classification • 0.6B • Updated Nov 12, 2025 • 61 • 5

Sep 30 Releases

deepseek-ai/DeepSeek-V3.2-Exp

Text Generation • 685B • Updated Nov 18, 2025 • 71.6k • • 930
Qwen3-VL

Collection

37 items • Updated 5 days ago • 555
SDLM

Collection

Sequential Diffusion Language Models • 9 items • Updated Oct 3, 2025 • 8
Ming-V2

Collection

10 items • Updated 12 days ago • 30

Sep 16 Releases

ibm-granite/granite-docling-258M

Image-Text-to-Text • 0.3B • Updated Sep 23, 2025 • 190k • 1.07k
XiaomiMiMo/MiMo-Audio-7B-Base

Any-to-Any • 8B • Updated Sep 23, 2025 • 96 • 46
decart-ai/Lucy-Edit-Dev

Video-to-Video • Updated Nov 20, 2025 • 206 • 317
OpenGVLab/ScaleCUA-3B

Image-Text-to-Text • 4B • Updated Sep 17, 2025 • 457 • 11

Sep 1 Releases

openbmb/MiniCPM4.1-8B

Text Generation • 8B • Updated Oct 24, 2025 • 12.7k • 384
tencent/Hunyuan-MT-7B

Translation • 8B • Updated 6 days ago • 21.3k • 717
google/embeddinggemma-300m

Sentence Similarity • 0.3B • Updated Sep 25, 2025 • 721k • • 1.39k
moonshotai/Kimi-K2-Instruct-0905

Text Generation • 1T • Updated Nov 7, 2025 • 31k • • 650

Aug 22 Releases

Qwen/Qwen-Image-Edit

Image-to-Image • Updated Aug 25, 2025 • 46.4k • • 2.25k
internlm/Intern-S1-mini

Image-Text-to-Text • 9B • Updated Oct 31, 2025 • 2.19k • 103
xai-org/grok-2

Updated Nov 5, 2025 • 1.35k • 1.01k
ByteDance-Seed/Seed-OSS-36B-Instruct

Text Generation • 36B • Updated Aug 26, 2025 • 8.66k • 469

Releases August 2

stepfun-ai/step3

Image-Text-to-Text • 321B • Updated Aug 2, 2025 • 71.1k • 164
nunchaku-tech/nunchaku-flux.1-krea-dev

Text-to-Image • Updated Nov 16, 2025 • 10.2k • 115
fdtn-ai/Foundation-Sec-8B-Instruct

Text Generation • 8B • Updated Aug 26, 2025 • 2.3k • • 61
Wan-AI/Wan2.2-TI2V-5B-Diffusers

Text-to-Video • Updated Aug 9, 2025 • 77.5k • 103

Releases July 18

nvidia/OpenReasoning-Nemotron-32B

Text Generation • 33B • Updated Sep 16, 2025 • 296 • • 121
ByteDance-Seed/Seed-X-RM-7B

Translation • Updated Jul 31, 2025 • 109 • 30
LGAI-EXAONE/EXAONE-4.0-32B

Text Generation • 32B • Updated Aug 4, 2025 • 12.5k • 268
vidore/colqwen-omni-v0.1

Visual Document Retrieval • Updated Jul 17, 2025 • 8.32k • 92

Releases July 4

apple/DiffuCoder-7B-cpGRPO

8B • Updated 28 days ago • 626 • 316
BAAI/MTVCraft

Text-to-Video • Updated Jul 7, 2025 • 24 • 36
kyutai/tts-1.6b-en_fr

Text-to-Speech • Updated Sep 11, 2025 • 97.9k • 360
apple/DiffuCoder-7B-Base

8B • Updated 28 days ago • 161 • 26

June 20 Releases

moonshotai/Kimi-VL-A3B-Thinking-2506

Image-Text-to-Text • 16B • Updated Aug 18, 2025 • 170k • 331
mistralai/Mistral-Small-3.2-24B-Instruct-2506

24B • Updated 14 days ago • 174k • 535
kyutai/stt-1b-en_fr

Automatic Speech Recognition • Updated Nov 18, 2025 • 104
google/magenta-realtime

Updated Aug 29, 2025 • 139 • 528

Releases June 13

ByteDance/LatentSync-1.6

Updated Jun 12, 2025 • 21.7k • 55
V-JEPA 2

Collection

A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of https://ai.meta.com/blog/v-jepa-yann • 8 items • Updated Jun 13, 2025 • 180
nanonets/Nanonets-OCR-s

Image-Text-to-Text • 4B • Updated Jun 20, 2025 • 31.5k • 1.56k
tencent/Hunyuan3D-2.1

Image-to-3D • Updated Oct 17, 2025 • 20.1k • 792

Releases 30 May

All the releases of the week of 30th May.

deepseek-ai/DeepSeek-R1-0528

Text Generation • 685B • Updated May 29, 2025 • 318k • • 2.39k
Running on Zero

Featured

215

BAGEL

🚀

215

Demo for BAGEL
tencent/HunyuanPortrait

Image-to-Video • Updated May 27, 2025 • 73
XiaomiMiMo/MiMo-7B-RL-0530

Text Generation • 8B • Updated Jun 5, 2025 • 244 • 41

May 16 Releases

Qwen/WorldPM-72B

Text Classification • 73B • Updated May 17, 2025 • 91 • 80
Running on Zero

MCP

Featured

1.44k

LTX Video Fast

🎥

1.44k

ultra-fast video model, LTX 0.9.8 13B distilled
BLIP3o/BLIP3o-Pretrain-Long-Caption

Viewer • Updated Jun 26, 2025 • 27.2M • 19.7k • 56
BLIP3o/BLIP3o-Model-8B

14B • Updated Jun 4, 2025 • 698 • 101

Any-to-Any Models, Datasets, Spaces

Running

Featured

79

MMaDA

🌍

79

Demo for MMaDA: Multimodal Large Diffusion Language Models
Running on Zero

Featured

215

BAGEL

🚀

215

Demo for BAGEL
Gen-Verse/MMaDA-8B-Base

Any-to-Any • 8B • Updated May 24, 2025 • 697 • 88
ByteDance-Seed/BAGEL-7B-MoT

Any-to-Any • 15B • Updated 28 days ago • 627 • 1.17k

InternVL3 HF

OpenGVLab/InternVL3-1B-hf

Image-Text-to-Text • 0.9B • Updated Apr 23, 2025 • 80.5k • 10
OpenGVLab/InternVL3-2B-hf

Image-Text-to-Text • 2B • Updated Apr 23, 2025 • 5.35k • 3
OpenGVLab/InternVL3-8B-hf

Image-Text-to-Text • 8B • Updated Apr 23, 2025 • 17.1k • 9
OpenGVLab/InternVL3-14B-hf

Image-Text-to-Text • 15B • Updated Apr 23, 2025 • 3.76k

Multimodal DSE Retrievers

A collection of DSE models for multimodal retrieval

racineai/Flantier-SmolVLM-2B-dse

2B • Updated Jun 18, 2025 • 6 • 11
MrLight/dse-qwen2-2b-mrl-v1

Visual Document Retrieval • Updated Feb 26, 2025 • 9.18k • 65
marco/mcdse-2b-v1

2B • Updated Oct 29, 2024 • 2.9k • 56
llamaindex/vdr-2b-multi-v1

Image-to-Text • 2B • Updated May 21, 2025 • 1.62k • 123

March 28 Releases

deepseek-ai/DeepSeek-V3-0324

Text Generation • 685B • Updated Mar 27, 2025 • 228k • • 3.08k
Qwen/Qwen2.5-Omni-7B

Any-to-Any • 11B • Updated Apr 30, 2025 • 156k • 1.84k
google/txgemma-27b-chat

Text Generation • 27B • Updated Apr 10, 2025 • 123 • 56
Running

Featured

364

Qwen2.5 Omni 7B Demo

🏆

364

Generate text and speech responses from text, audio, images, or video input

Türkçe VLMler

Qwen/Qwen2-VL-7B-Instruct

Image-Text-to-Text • 8B • Updated Feb 6, 2025 • 1.14M • • 1.25k
Qwen/Qwen2-VL-2B-Instruct

Image-Text-to-Text • 2B • Updated Jan 12, 2025 • 1.57M • 477
CohereLabs/aya-vision-8b

Image-Text-to-Text • 9B • Updated Oct 30, 2025 • 40.4k • 315
CohereLabs/aya-vision-32b

Image-Text-to-Text • 33B • Updated Oct 30, 2025 • 258 • • 217

Feb 7 Releases 🧣

lerobot/pi0_old

Robotics • 4B • Updated Sep 19, 2025 • 506 • 304
kyutai/hibiki-2b-pytorch-bf16

Translation • Updated May 28, 2025 • 2.77k • 56
Alpha-VLLM/Lumina-Image-2.0

Text-to-Image • Updated Mar 30, 2025 • 2.53k • • 351
adyen/DABstep

Viewer • Updated 7 days ago • 460 • 6.52k • 38

Models, Jan 27

Running on Zero

266

Qwen2-VL-7B

🔥

266

Generate text from an image and question
Running

65

UI-TARS

🌖

65

Find click coordinates on images based on instructions
Running

95

Qwen2.5-1M Demo

💻

95

Upload documents and ask questions
Qwen/Qwen2.5-14B-Instruct-1M

Text Generation • 15B • Updated Jan 29, 2025 • 5.04k • • 331

Jan 17 Releases ❄️

Models and datasets of the second week of Jan 2025.

openbmb/MiniCPM-o-2_6

Any-to-Any • 9B • Updated Oct 5, 2025 • 81.4k • 1.28k
MiniMaxAI/MiniMax-Text-01

Text Generation • 456B • Updated Jul 3, 2025 • 1.84k • 652
OuteAI/OuteTTS-0.3-1B

Text-to-Speech • 1B • Updated Apr 24, 2025 • 277 • 106
NovaSky-AI/Sky-T1_data_17k

Viewer • Updated Jan 14, 2025 • 16.4k • 267 • 186

Dec 6 Releases 🎄

meta-llama/Llama-3.3-70B-Instruct

Text Generation • 71B • Updated Dec 21, 2024 • 312k • • 2.61k
Qwen/Qwen2-VL-72B

Image-Text-to-Text • 73B • Updated Dec 6, 2024 • 108 • 80
google/paligemma2-3b-pt-224

Image-Text-to-Text • 3B • Updated Dec 5, 2024 • 65.2k • 160
tencent/HunyuanVideo

Text-to-Video • Updated Mar 6, 2025 • 1.04k • • 2.1k

Nov 22 Releases ❄️

mistralai/Pixtral-Large-Instruct-2411

Updated Jul 28, 2025 • 71 • 430
microsoft/orca-agentinstruct-1M-v1

Viewer • Updated Nov 1, 2024 • 1.05M • 905 • 453
Xkev/Llama-3.2V-11B-cot

Image-Text-to-Text • 11B • Updated Nov 16, 2025 • 659 • 158
jinaai/jina-clip-v2

Feature Extraction • 0.9B • Updated Apr 28, 2025 • 39.2k • 299

Nov 1 Releases

Runtime error

86

LongVU

🌖

86

Generate responses to video or image inputs
facebook/MobileLLM-1B

Text Generation • Updated May 5, 2025 • 176 • 121
Vision-CAIR/LongVU_Qwen2_7B

Video-Text-to-Text • 8B • Updated Feb 28, 2025 • 94 • 74
Vision-CAIR/LongVU_Llama3_2_3B_img

Updated Feb 28, 2025 • 10 • 6

October 25 Releases

ibm-granite/granite-3.0-8b-instruct

Text Generation • 8B • Updated Dec 19, 2024 • 16.8k • 204
ibm-granite/granite-3.0-2b-instruct

Text Generation • 3B • Updated Dec 19, 2024 • 3.41k • 46
CohereLabs/aya-expanse-8b

Text Generation • 8B • Updated Sep 11, 2025 • 174k • 418
CohereLabs/aya-expanse-32b

Text Generation • 32B • Updated Sep 11, 2025 • 5.58k • • 282

New Depth Models

Recent depth models

Running on Zero

Featured

193

DepthCrafter

🦀

193

a super consistent video depth model
Paused

Featured

223

Depth Pro

🚀

223

Generate an inverse depth map from an image
Runtime error

78

LOTUS Depth

🚀

78

Generate depth maps from images and videos
apple/DepthPro

Depth Estimation • Updated Feb 28, 2025 • 7.44k • 495

Computer Vision Backbones 🧩

Collection of useful computer vision backbones to fine-tune. It also includes large image classification models, that can be used as backbone.

microsoft/resnet-50

Image Classification • 25.6M • Updated Feb 13, 2024 • 268k • • 471
google/vit-base-patch16-224-in21k

Image Feature Extraction • 86.4M • Updated Feb 5, 2024 • 1.3M • 392
google/vit-base-patch32-224-in21k

Image Feature Extraction • 88M • Updated Dec 8, 2022 • 5.34k • 19
facebook/dinov2-large

Image Feature Extraction • 0.3B • Updated Sep 6, 2023 • 3.02M • 99

Object Detection Models 🥥

facebook/detr-resnet-50

Object Detection • 41.6M • Updated Apr 10, 2024 • 1.43M • • 917
facebook/detr-resnet-101-dc5

Object Detection • 60.7M • Updated Sep 6, 2023 • 1.37k • 19
facebook/detr-resnet-50-dc5

Object Detection • 41.6M • Updated Sep 7, 2023 • 1.47k • 6
google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 88k • 143

Zero-shot Image Classification Models 🖼️

This is a collection for models that can be used for zero-shot image classification.

openai/clip-vit-large-patch14

Zero-Shot Image Classification • 0.4B • Updated Sep 15, 2023 • 7.71M • 1.94k
openai/clip-vit-base-patch32

Zero-Shot Image Classification • Updated Feb 29, 2024 • 14.3M • 829
laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

Zero-Shot Image Classification • Updated Jan 22, 2025 • 85.7k • 303
kakaobrain/align-base

Zero-Shot Image Classification • Updated Mar 8, 2023 • 9.52k • 28

Video Classification Models 📺

microsoft/xclip-base-patch32

Video Classification • 0.2B • Updated Feb 4, 2024 • 190k • 106
facebook/timesformer-base-finetuned-k400

Video Classification • Updated Jan 2, 2023 • 24.7k • 42
facebook/timesformer-base-finetuned-k600

Video Classification • Updated Dec 12, 2022 • 6.2k • 12
google/vivit-b-16x2

Video Classification • Updated Aug 3, 2023 • 6.48k • 11

Text-to-Image Models 🥑

stabilityai/stable-diffusion-xl-base-1.0

Text-to-Image • Updated Oct 30, 2023 • 1.8M • • 7.29k
warp-ai/wuerstchen

Text-to-Image • Updated Mar 12, 2024 • 194 • 176
Deci/DeciDiffusion-v1-0

Text-to-Image • Updated Feb 15, 2024 • 25 • 140
stabilityai/stable-diffusion-xl-refiner-1.0

Image-to-Image • Updated Sep 25, 2023 • 329k • 2.02k

Segment Anything Model

This collection contains models and demos of SAM and it's smaller friends.

facebook/sam-vit-huge

Mask Generation • 0.6B • Updated Jan 11, 2024 • 182k • 188
facebook/sam-vit-base

Mask Generation • 93.7M • Updated Jan 11, 2024 • 471k • 159
facebook/sam-vit-large

Mask Generation • 0.3B • Updated Jan 11, 2024 • 115k • 31
Runtime error

43

Grounded SAM

💩

43

SigLIP

A collection dedicated to SigLIP applications

Running on Zero

Featured

72

Draw To Search Art

🐠

72

Draw/upload image and search among WikiART using SigLIP
Running on CPU Upgrade

23

Compare Clip Siglip

🏃

23

Compare strong zero-shot image classification models
Running on Zero

13

Multilingual Zero Shot Image Clf

🏢

13

Comparing powerful multilingual zero-shot image clf models
BAAI/bunny-phi-2-siglip-lora

Text Generation • Updated Mar 28, 2024 • 198 • 48

SegGPT

A collection of everything SegGPT.

Images Speak in Images: A Generalist Painter for In-Context Visual Learning

Paper • 2212.02499 • Published Dec 5, 2022
SegGPT: Segmenting Everything In Context

Paper • 2304.03284 • Published Apr 6, 2023 • 1
BAAI/seggpt-vit-large

0.4B • Updated Feb 22, 2024 • 28.5k • 5
BAAI/SegGPT

Updated Apr 21, 2023 • 19

gvhf/owl

google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 88k • 143
google/owlvit-base-patch16

Zero-Shot Object Detection • Updated Dec 12, 2023 • 19.4k • 13
google/owlvit-large-patch14

Zero-Shot Object Detection • Updated Dec 12, 2023 • 8.92k • 29
google/owlv2-base-patch16

Zero-Shot Object Detection • 0.2B • Updated Apr 15, 2024 • 15.3k • 29

merve/owl2

google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 88k • 143
google/owlvit-base-patch16

Zero-Shot Object Detection • Updated Dec 12, 2023 • 19.4k • 13
google/owlvit-large-patch14

Zero-Shot Object Detection • Updated Dec 12, 2023 • 8.92k • 29
google/owlv2-base-patch16

Zero-Shot Object Detection • 0.2B • Updated Apr 15, 2024 • 15.3k • 29

Document VLM Papers

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

Paper • 2407.12594 • Published Jul 17, 2024 • 19

Video Language Models

A collection of video-language models

Paused

21

Video Llava

🐨

21

Generate descriptions by uploading images or videos
llava-hf/LLaVA-NeXT-Video-7B-hf

Video-Text-to-Text • 7B • Updated Nov 11, 2025 • 55.7k • 121
llava-hf/LLaVA-NeXT-Video-7B-DPO-hf

Video-Text-to-Text • 7B • Updated Nov 11, 2025 • 382 • 11
llava-hf/LLaVA-NeXT-Video-7B-32K-hf

Image-Text-to-Text • 8B • Updated Nov 11, 2025 • 192 • 8

NVEagle

NVEagle/Eagle-X5-13B

Image-Text-to-Text • 15B • Updated Sep 16, 2024 • 19 • 15
NVEagle/Eagle-X5-13B-Chat

Image-Text-to-Text • 15B • Updated Sep 16, 2024 • 33 • 28
NVEagle/Eagle-X5-7B

Image-Text-to-Text • 9B • Updated Sep 16, 2024 • 47 • 26
Runtime error

64

Eagle X5 13B Chat

🚀

64

Combine text and images to generate responses

Zero-shot Segmentation

sam-hq-team/SegInW

Updated Jul 13, 2023 • 1
xdecoder/X-Decoder

Updated Dec 27, 2023 • 5
xdecoder/SEEM

Updated Dec 30, 2023 • 8
Runtime error

Featured

60

OWLSAM2

🏃

60

Dec 12 Releases

openai/circuit-sparsity

Text Generation • 0.4B • Updated 24 days ago • 2.14k • 195
FunAudioLLM/Fun-CosyVoice3-0.5B-2512

Text-to-Speech • Updated 37 minutes ago • 2.66k • 328
DiffSynth-Studio/Qwen-Image-i2L

Updated 20 days ago • 241
Aratako/T5Gemma-TTS-2b-2b

Text-to-Speech • 5B • Updated 13 days ago • 10.3k • 98

SAM3

facebook/sam3

Mask Generation • 0.9B • Updated Nov 20, 2025 • 1.33M • 1.28k
Running on Zero

Featured

97

SAM3 Video Segmentation

🐠

97

Track and label objects in videos using text prompts or clicks
onnx-community/sam3-tracker-ONNX

Mask Generation • Updated Nov 19, 2025 • 2.17k • 23
Running

22

SAM3 Tracker WebGPU

🎯

22

Segment and extract parts from images by clicking

Oct 6 Releases

Kwaipilot/KAT-Dev-72B-Exp

Text Generation • 73B • Updated Oct 13, 2025 • 633 • 160
LiquidAI/LFM2-8B-A1B

Text Generation • 8B • Updated about 1 month ago • 8.82k • 287
yanolja/YanoljaNEXT-Rosetta-12B-2510

Translation • 12B • Updated Nov 2, 2025 • 831 • 29
NeuML/colbert-muvera-femto

Sentence Similarity • 243k • Updated 23 days ago • 72 • 20

Sep 23 Releases

ByteDance/lynx

Image-to-Video • Updated Sep 27, 2025 • • 136
tencent/HunyuanImage-3.0

Text-to-Image • 83B • Updated Oct 14, 2025 • 55.7k • • 1k
meituan-longcat/LongCat-Flash-Thinking

Text Generation • 562B • Updated Sep 24, 2025 • 46 • 148
Qwen/Qwen3Guard-Gen-4B

Text Generation • 4B • Updated Nov 7, 2025 • 11.2k • 33

Sep 11 Releases

bytedance-research/HuMo

Image-to-Video • Updated Sep 18, 2025 • 576 • 259
facebook/MobileLLM-R1-950M

Text Generation • 0.9B • Updated Sep 30, 2025 • 420 • 353
tencent/POINTS-Reader

Image-Text-to-Text • 4B • Updated Sep 12, 2025 • 162k • 97
baidu/ERNIE-4.5-21B-A3B-Thinking

Text Generation • 22B • Updated Nov 26, 2025 • 492 • • 771

August 29 Releases

microsoft/VibeVoice-1.5B

Text-to-Speech • 3B • Updated Sep 1, 2025 • 542k • 2.13k
OpenGVLab/InternVL3_5-GPT-OSS-20B-A4B-Preview

Image-Text-to-Text • 0.4B • Updated Aug 29, 2025 • 34.3k • 82
apple/FastVLM-1.5B

Text Generation • 2B • Updated Sep 3, 2025 • 2.78k • 71
stepfun-ai/Step-Audio-2-mini

Any-to-Any • 8B • Updated Sep 5, 2025 • 1.13k • 241

Releases August 9

openai/gpt-oss-120b

Text Generation • 120B • Updated Aug 26, 2025 • 3.5M • • 4.31k
openai/gpt-oss-20b

Text Generation • 22B • Updated Aug 26, 2025 • 6.63M • • 4.16k
openai/BrowseCompLongContext

Viewer • Updated Aug 9, 2025 • 295 • 548 • 45
baichuan-inc/Baichuan-M2-32B

Text Generation • 33B • Updated 12 days ago • 156k • • 117

Releases July 25

Wan-AI/Wan2.2-I2V-A14B

Image-to-Video • Updated Aug 7, 2025 • 10.2k • • 557
allenai/olmOCR-7B-0725

Image-Text-to-Text • 8B • Updated Aug 26, 2025 • 361 • 62
Wan-AI/Wan2.2-T2V-A14B

Text-to-Video • Updated Aug 7, 2025 • 4.2k • • 388
Qwen/Qwen3-235B-A22B-Thinking-2507

Text Generation • 235B • Updated Aug 17, 2025 • 23.6k • • 391

Releases July 11

HuggingFaceTB/SmolLM3-3B

Text Generation • 3B • Updated Sep 10, 2025 • 74.7k • • 863
moonshotai/Kimi-K2-Instruct

Text Generation • 1T • Updated Nov 7, 2025 • 66.5k • • 2.29k
fal/Realism-Detailer-Kontext-Dev-LoRA

Image-to-Image • Updated Jul 7, 2025 • 128 • • 53
Alibaba-NLP/WebSailor-3B

3B • Updated Jul 10, 2025 • 155 • 74

Releases June 27

nari-labs/Dia-1.6B-0626

Text-to-Speech • 2B • Updated Jul 3, 2025 • 48.2k • 122
google/gemma-3n-E4B-it

Image-Text-to-Text • 8B • Updated Jul 14, 2025 • 124k • 843
ByteDance/XVerse

Text-to-Image • Updated Jul 1, 2025 • 44 • 89
nvidia/llama-nemoretriever-colembed-3b-v1

Visual Document Retrieval • 4B • Updated 13 days ago • 735 • 70

OCR Models & Datasets

opendatalab/OmniDocBench

Viewer • Updated Sep 26, 2025 • 1.36k • 8.19k • 59
nanonets/Nanonets-OCR-s

Image-Text-to-Text • 4B • Updated Jun 20, 2025 • 31.5k • 1.56k
echo840/MonkeyOCR

Image-Text-to-Text • Updated Aug 28, 2025 • 296 • 513
Running on Zero

MCP

Featured

139

Multimodal OCR2

💻

139

nanonets ocr / smoldocling / monkey ocr / typhoon ocr

Releases June 6

Qwen/Qwen3-Reranker-4B

Text Ranking • 4B • Updated Jun 9, 2025 • 81.2k • 110
echo840/MonkeyOCR

Image-Text-to-Text • Updated Aug 28, 2025 • 296 • 513
openbmb/MiniCPM4-8B

Text Generation • 8B • Updated Oct 24, 2025 • 1.25k • 280
arcee-ai/Homunculus

Text Generation • 12B • Updated Jun 3, 2025 • 70 • 99

Releases 23 May

ByteDance-Seed/BAGEL-7B-MoT

Any-to-Any • 15B • Updated 28 days ago • 627 • 1.17k
mistralai/Devstral-Small-2505

24B • Updated Aug 18, 2025 • 15k • 859
ByteDance/Dolphin

Image-Text-to-Text • 0.4B • Updated Jul 16, 2025 • 2.49k • 510
moondream/moondream-2b-2025-04-14-4bit

Image-Text-to-Text • 1B • Updated May 22, 2025 • 3.54k • 60

May 9 Releases

tencent/HunyuanCustom

Image-to-Video • Updated Jun 6, 2025 • 190
stepfun-ai/Step1X-3D

Updated May 13, 2025 • 105
cognition-ai/Kevin-32B

33B • Updated May 6, 2025 • 732 • 160
ServiceNow-AI/Apriel-Nemotron-15b-Thinker

Text Generation • 15B • Updated Nov 10, 2025 • 351 • 123

Releases Apr 21 & May 2

facebook/EdgeTAM

Updated Apr 30, 2025 • 29
nvidia/parakeet-tdt-0.6b-v2

Automatic Speech Recognition • Updated Nov 27, 2025 • 614k • 1.4k
deepseek-ai/DeepSeek-Prover-V2-671B

Text Generation • 685B • Updated Apr 30, 2025 • 604 • • 815
Qwen/Qwen2.5-Omni-3B

Any-to-Any • 6B • Updated Apr 30, 2025 • 208k • 314

April 16 Releases

giskardai/realharm

Viewer • Updated Apr 16, 2025 • 136 • 137 • 12
Junfeng5/Liquid_V1_7B

Any-to-Any • 9B • Updated Mar 20, 2025 • 1.98k • 95

April 11 Releases

moonshotai/Kimi-VL-A3B-Thinking

Image-Text-to-Text • 16B • Updated Aug 18, 2025 • 8.56k • 443
agentica-org/DeepCoder-14B-Preview

Text Generation • 15B • Updated May 11, 2025 • 886 • • 681
HiDream-ai/HiDream-I1-Full

Text-to-Image • Updated Jul 17, 2025 • 8.89k • • 982
OpenGVLab/InternVL3-78B

Image-Text-to-Text • 78B • Updated Sep 11, 2025 • 213k • 226

March 21 Releases

docling-project/SmolDocling-256M-preview

Image-Text-to-Text • 0.3B • Updated Sep 17, 2025 • 51.6k • 1.6k
sesame/csm-1b

Text-to-Speech • Updated Dec 1, 2025 • 25.6k • 2.3k
mistralai/Mistral-Small-3.1-24B-Instruct-2503

24B • Updated 14 days ago • 80.1k • 1.34k
tencent/Hunyuan3D-2mini

Image-to-3D • Updated Oct 17, 2025 • 4k • 105

Feb 14 Releases 💌

OpenGVLab/InternVideo2_5_Chat_8B

Video-Text-to-Text • 8B • Updated Aug 4, 2025 • 17.2k • 87
AIDC-AI/Ovis2-34B

Image-Text-to-Text • 35B • Updated Aug 15, 2025 • 5.32k • 151
open-r1/OpenR1-Qwen-7B

Text Generation • 8B • Updated May 28, 2025 • 28 • • 54
nomic-ai/nomic-embed-text-v2-moe

Sentence Similarity • 0.5B • Updated Apr 1, 2025 • 835k • 445

January 31 Releases 🧤

allenai/Llama-3.1-Tulu-3-405B

Text Generation • 406B • Updated Feb 10, 2025 • 128 • 110
Qwen/Qwen2.5-VL-72B-Instruct

Image-Text-to-Text • 73B • Updated Jun 6, 2025 • 62.7k • • 578
mistralai/Mistral-Small-24B-Instruct-2501

24B • Updated Jul 28, 2025 • 963k • 949
deepseek-ai/Janus-Pro-7B

Any-to-Any • Updated Feb 1, 2025 • 56k • 3.55k

Jan 24 Releases

ostris/Flex.1-alpha

Text-to-Image • Updated Jan 19, 2025 • 816 • 481
Qwen/Qwen2.5-Math-PRM-72B

Text Classification • 73B • Updated Jan 17, 2025 • 18.7k • 72
HuggingFaceTB/SmolVLM-500M-Instruct

Image-Text-to-Text • 0.5B • Updated Apr 8, 2025 • 39.5k • 183
deepseek-ai/DeepSeek-R1

Text Generation • 685B • Updated Mar 27, 2025 • 443k • • 12.9k

Jan 10 Releases 🌨️

vikhyatk/moondream2

Image-Text-to-Text • 2B • Updated Sep 23, 2025 • 2.89M • 1.36k
DAMO-NLP-SG/multimodal_textbook

Updated Mar 17, 2025 • 3.86k • 156
ByteDance/Sa2VA-1B

Image-Text-to-Text • 1B • Updated Sep 8, 2025 • 222 • 29
nvidia/Cosmos-1.0-Autoregressive-4B

Updated Feb 11, 2025 • 51 • 56

Nov 29 Releases 🌲🌲

HuggingFaceTB/SmolVLM-Instruct

Image-Text-to-Text • 2B • Updated Apr 8, 2025 • 25.4k • 571
Qwen/QwQ-32B-Preview

Text Generation • 33B • Updated Jan 12, 2025 • 9.4k • • 1.74k
nvidia/Hymba-1.5B-Base

Text Generation • 2B • Updated Nov 26, 2025 • 378 • 155
vidore/colsmolvlm-v0.1

Visual Document Retrieval • Updated Mar 14, 2025 • 31 • 54

Nov 15 Releases 🍂

microsoft/LLM2CLIP-EVA02-L-14-336

Zero-Shot Image Classification • Updated Nov 22, 2024 • 89 • 59
microsoft/LLM2CLIP-EVA02-B-16

Updated Feb 8, 2025 • 130 • 10
PleIAs/common_corpus

Viewer • Updated Jun 10, 2025 • 470M • 41.9k • 321
Qwen/Qwen2.5-Coder-32B-Instruct

Text Generation • 33B • Updated Jan 12, 2025 • 232k • • 1.96k

MIT Talk 31/10 Papers

NVLM: Open Frontier-Class Multimodal LLMs

Paper • 2409.11402 • Published Sep 17, 2024 • 74
BRAVE: Broadening the visual encoding of vision-language models

Paper • 2404.07204 • Published Apr 10, 2024 • 19
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Paper • 2403.18814 • Published Mar 27, 2024 • 47
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Paper • 2409.17146 • Published Sep 25, 2024 • 121

LOTUS 🪷

Runtime error

Featured

101

LOTUS Normal

🌍

101

Generate high-quality predictions from images
Runtime error

78

LOTUS Depth

🚀

78

Generate depth maps from images and videos
jingheya/lotus-depth-g-v1-0

Depth Estimation • Updated Oct 5, 2024 • 10.6k • 26
jingheya/lotus-depth-d-v1-0

Depth Estimation • Updated Oct 5, 2024 • 301 • 5

BRAVE Models 🦁

Models mentioned in https://huggingface.co/papers/2404.07204

facebook/dinov2-large

Image Feature Extraction • 0.3B • Updated Sep 6, 2023 • 3.02M • 99
google/flan-t5-xl

3B • Updated Nov 28, 2023 • 134k • 526
google/siglip-large-patch16-384

Zero-Shot Image Classification • 0.7B • Updated Sep 26, 2024 • 14.7k • 11
google/vit-huge-patch14-224-in21k

Image Feature Extraction • 0.6B • Updated Feb 14, 2024 • 25.8k • 22

Image Classification Models 🐶 🐱

facebook/deit-base-distilled-patch16-384

Image Classification • 87.6M • Updated Sep 12, 2023 • 383 • 7
facebook/convnextv2-base-1k-224

Image Classification • 88.7M • Updated Feb 17, 2025 • 766 • 4
facebook/deit-base-distilled-patch16-224

Image Classification • Updated Jul 13, 2022 • 10.1k • • 31
google/vit-base-patch32-384

Image Classification • 88.3M • Updated Sep 11, 2023 • 16.6k • • 23

Image Segmentation Models 💜

A collection of instance/semantic/panoptic segmentation models.

facebook/maskformer-swin-large-coco

Image Segmentation • 0.2B • Updated Sep 11, 2023 • 1.3k • • 27
nvidia/segformer-b0-finetuned-ade-512-512

Image Segmentation • 3.75M • Updated Jan 14, 2024 • 187k • • 178
facebook/detr-resnet-50-dc5-panoptic

Image Segmentation • 43M • Updated Sep 11, 2023 • 26 • 3
nvidia/segformer-b5-finetuned-cityscapes-1024-1024

Image Segmentation • Updated Aug 9, 2022 • 86.6k • • 36

Image-to-Image Models 🎨

Collection of image to image editing, image enhancement (SR, deblur, brighten) and text-to-image adapter models.

timbrooks/instruct-pix2pix

Image-to-Image • Updated Jul 5, 2023 • 85.1k • 1.16k
TencentARC/t2i-adapter-canny-sdxl-1.0

Image-to-Image • Updated Sep 7, 2023 • 3.1k • 52
TencentARC/t2i-adapter-sketch-sdxl-1.0

Image-to-Image • Updated Sep 8, 2023 • 3.79k • 75
CrucibleAI/ControlNetMediaPipeFace

Image-to-Image • Updated May 19, 2023 • 1.18k • 574

Image-to-Text Models 📝

This collection contains image captioning and OCR models.

Salesforce/blip-image-captioning-large

Image-to-Text • 0.5B • Updated Feb 3, 2025 • 1.48M • 1.44k
Salesforce/blip-image-captioning-base

Image-to-Text • Updated Feb 3, 2025 • 1.62M • 829
microsoft/trocr-base-handwritten

Image-to-Text • 0.3B • Updated Feb 11, 2025 • 133k • 469
microsoft/git-large-coco

Image-to-Text • 0.4B • Updated Jun 26, 2023 • 1.63k • 104

Foundation Models for Vision 🧩

Foundation models for computer vision.

Running

110

Grounding DINO Demo

💻

110

Cutting edge open-vocabulary object detection app
Running

Featured

94

Owlv2

👀

94

State-of-the-art Zero-shot Object Detection
Runtime error

Featured

41

BLIP2 with transformers

🌖

41

BLIP2 (cutting edge image captioning) in 🤗transformers
Build error

Featured

377

IDEFICS Playground

🐨

377

OWL-series 🦉

Models and applications of OWL-ViT and OWLv2.

Running

Featured

94

Owlv2

👀

94

State-of-the-art Zero-shot Object Detection
Runtime error

Featured

64

Owl Tracking

⚡

64

Powerful foundation model for zero-shot object tracking
Running

26

Search and Detect (CLIP/OWL-ViT)

🦉

26

Search and detect objects in images using text queries
Running on Zero

Featured

109

OWLSAM

😻

109

State-of-the-art open-vocabulary image segmentation ⚡️

Awesome Document AI

A collection of open-source document AI 📄 📝 📈

Runtime error

Featured

84

UDOP

🏃

84

Generate text from document images
Runtime error

40

Pix2struct

📚

40

Play with all the pix2struct variants in this d
Running

26

Compare Docvqa Models

🦀

26

Compare different visual question answering
Runtime error

Featured

289

DocQuery — Document Query Engine

🦉

289

Vision Language Models Papers 🖼️💬📝

Papers about vision-language models, most important ones are on top of the list.

Improved Baselines with Visual Instruction Tuning

Paper • 2310.03744 • Published Oct 5, 2023 • 39
DeepSeek-VL: Towards Real-World Vision-Language Understanding

Paper • 2403.05525 • Published Mar 8, 2024 • 48
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

Paper • 2308.12966 • Published Aug 24, 2023 • 11
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model

Paper • 2404.01331 • Published Mar 29, 2024 • 27

gv-hf/owl

google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 88k • 143
google/owlvit-base-patch16

Zero-Shot Object Detection • Updated Dec 12, 2023 • 19.4k • 13
google/owlvit-large-patch14

Zero-Shot Object Detection • Updated Dec 12, 2023 • 8.92k • 29
google/owlv2-base-patch16

Zero-Shot Object Detection • 0.2B • Updated Apr 15, 2024 • 15.3k • 29

Depth Anything v2 Release

A comprehensive collection on DAv2

depth-anything/Depth-Anything-V2-Small

Depth Estimation • Updated Jul 8, 2024 • 12.5k • 75
depth-anything/Depth-Anything-V2-Large

Depth Estimation • Updated Jul 8, 2024 • 181k • 144
Running on Zero

603

Depth Anything V2

🌖

603

Generate depth maps from images
depth-anything/DA-2K

Viewer • Updated Jun 14, 2024 • 1.04k • 413 • 16

Vision Language Leaderboards

This collection has all the vision language leaderboards.

Running

192

Vidore Leaderboard

🥇

192

Browse and compare visual document retrieval models
Running on CPU Upgrade

956

Open VLM Leaderboard

🌎

956

VLMEvalKit Evaluation Results Collection
Running

Featured

558

Vision Arena (Testing VLMs side-by-side)

🖼

558

Display image analysis results
Running

Featured

85

SEED-Bench Leaderboard

🏆

85

Submit model evaluation results to leaderboard

SAM2

All the models and demos for SAM2

merve/sam2-hiera-tiny

Mask Generation • Updated Aug 2, 2024 • 10
merve/sam2-hiera-small

Mask Generation • Updated Aug 2, 2024 • 31 • 2
merve/sam2-hiera-large

Mask Generation • Updated Aug 2, 2024 • 31 • 2
merve/sam2-hiera-base-plus

Mask Generation • Updated Aug 2, 2024 • 77

Multimodal RAG

vidore/colpali-v1.2

Visual Document Retrieval • Updated Mar 14, 2025 • 331k • 112
Qwen/Qwen2-VL-7B-Instruct

Image-Text-to-Text • 8B • Updated Feb 6, 2025 • 1.14M • • 1.25k
Qwen/Qwen2-VL-2B-Instruct

Image-Text-to-Text • 2B • Updated Jan 12, 2025 • 1.57M • 477
Qwen/Qwen2-72B-Instruct

Text Generation • 73B • Updated Oct 8, 2024 • 20.4k • • 718

Dec 19 Releases

nvidia/NitroGen

Updated 17 days ago • 447
google/gemma-scope-2

Updated 16 days ago • 58
FunAudioLLM/Fun-ASR-MLT-Nano-2512

Updated 13 days ago • 133 • 32
facebook/map-anything-v1

Image-to-3D • 0.6B • Updated 17 days ago • 440 • 19

Dec 12 Releases

openai/circuit-sparsity

Text Generation • 0.4B • Updated 24 days ago • 2.14k • 195
FunAudioLLM/Fun-CosyVoice3-0.5B-2512

Text-to-Speech • Updated 37 minutes ago • 2.66k • 328
DiffSynth-Studio/Qwen-Image-i2L

Updated 20 days ago • 241
Aratako/T5Gemma-TTS-2b-2b

Text-to-Speech • 5B • Updated 13 days ago • 10.3k • 98

Real-time Vision Models

A collection of real-time detectors.

PekingU/rtdetr_v2_r50vd

Object Detection • 43M • Updated Feb 6, 2025 • 153k • 26
ustc-community/dfine-xlarge-obj365

Object Detection • 63.4M • Updated May 5, 2025 • 621 • 4
PekingU/rtdetr_v2_r101vd

Object Detection • 76.8M • Updated Feb 6, 2025 • 4.21k • 13
Running on T4

113

RF-DETR

🔥

113

SOTA real-time object detection model

SAM3

facebook/sam3

Mask Generation • 0.9B • Updated Nov 20, 2025 • 1.33M • 1.28k
Running on Zero

Featured

97

SAM3 Video Segmentation

🐠

97

Track and label objects in videos using text prompts or clicks
onnx-community/sam3-tracker-ONNX

Mask Generation • Updated Nov 19, 2025 • 2.17k • 23
Running

22

SAM3 Tracker WebGPU

🎯

22

Segment and extract parts from images by clicking

MetaCLIP2 Multilingual

facebook/metaclip-2-worldwide-s16

Zero-Shot Image Classification • 0.4B • Updated Nov 12, 2025 • 83 • 8
facebook/metaclip-2-worldwide-m16

Zero-Shot Image Classification • 0.5B • Updated Nov 12, 2025 • 8 • 3
facebook/metaclip-2-worldwide-l14

Zero-Shot Image Classification • 1B • Updated Nov 12, 2025 • 222 • 12
facebook/metaclip-2-worldwide-b32

Zero-Shot Image Classification • 0.6B • Updated Nov 12, 2025 • 61 • 5

Oct 6 Releases

Kwaipilot/KAT-Dev-72B-Exp

Text Generation • 73B • Updated Oct 13, 2025 • 633 • 160
LiquidAI/LFM2-8B-A1B

Text Generation • 8B • Updated about 1 month ago • 8.82k • 287
yanolja/YanoljaNEXT-Rosetta-12B-2510

Translation • 12B • Updated Nov 2, 2025 • 831 • 29
NeuML/colbert-muvera-femto

Sentence Similarity • 243k • Updated 23 days ago • 72 • 20

Sep 30 Releases

deepseek-ai/DeepSeek-V3.2-Exp

Text Generation • 685B • Updated Nov 18, 2025 • 71.6k • • 930
Qwen3-VL

Collection

37 items • Updated 5 days ago • 555
SDLM

Collection

Sequential Diffusion Language Models • 9 items • Updated Oct 3, 2025 • 8
Ming-V2

Collection

10 items • Updated 12 days ago • 30

Sep 23 Releases

ByteDance/lynx

Image-to-Video • Updated Sep 27, 2025 • • 136
tencent/HunyuanImage-3.0

Text-to-Image • 83B • Updated Oct 14, 2025 • 55.7k • • 1k
meituan-longcat/LongCat-Flash-Thinking

Text Generation • 562B • Updated Sep 24, 2025 • 46 • 148
Qwen/Qwen3Guard-Gen-4B

Text Generation • 4B • Updated Nov 7, 2025 • 11.2k • 33

Sep 16 Releases

ibm-granite/granite-docling-258M

Image-Text-to-Text • 0.3B • Updated Sep 23, 2025 • 190k • 1.07k
XiaomiMiMo/MiMo-Audio-7B-Base

Any-to-Any • 8B • Updated Sep 23, 2025 • 96 • 46
decart-ai/Lucy-Edit-Dev

Video-to-Video • Updated Nov 20, 2025 • 206 • 317
OpenGVLab/ScaleCUA-3B

Image-Text-to-Text • 4B • Updated Sep 17, 2025 • 457 • 11

Sep 11 Releases

bytedance-research/HuMo

Image-to-Video • Updated Sep 18, 2025 • 576 • 259
facebook/MobileLLM-R1-950M

Text Generation • 0.9B • Updated Sep 30, 2025 • 420 • 353
tencent/POINTS-Reader

Image-Text-to-Text • 4B • Updated Sep 12, 2025 • 162k • 97
baidu/ERNIE-4.5-21B-A3B-Thinking

Text Generation • 22B • Updated Nov 26, 2025 • 492 • • 771

Sep 1 Releases

openbmb/MiniCPM4.1-8B

Text Generation • 8B • Updated Oct 24, 2025 • 12.7k • 384
tencent/Hunyuan-MT-7B

Translation • 8B • Updated 6 days ago • 21.3k • 717
google/embeddinggemma-300m

Sentence Similarity • 0.3B • Updated Sep 25, 2025 • 721k • • 1.39k
moonshotai/Kimi-K2-Instruct-0905

Text Generation • 1T • Updated Nov 7, 2025 • 31k • • 650

August 29 Releases

microsoft/VibeVoice-1.5B

Text-to-Speech • 3B • Updated Sep 1, 2025 • 542k • 2.13k
OpenGVLab/InternVL3_5-GPT-OSS-20B-A4B-Preview

Image-Text-to-Text • 0.4B • Updated Aug 29, 2025 • 34.3k • 82
apple/FastVLM-1.5B

Text Generation • 2B • Updated Sep 3, 2025 • 2.78k • 71
stepfun-ai/Step-Audio-2-mini

Any-to-Any • 8B • Updated Sep 5, 2025 • 1.13k • 241

Aug 22 Releases

Qwen/Qwen-Image-Edit

Image-to-Image • Updated Aug 25, 2025 • 46.4k • • 2.25k
internlm/Intern-S1-mini

Image-Text-to-Text • 9B • Updated Oct 31, 2025 • 2.19k • 103
xai-org/grok-2

Updated Nov 5, 2025 • 1.35k • 1.01k
ByteDance-Seed/Seed-OSS-36B-Instruct

Text Generation • 36B • Updated Aug 26, 2025 • 8.66k • 469

Releases August 9

openai/gpt-oss-120b

Text Generation • 120B • Updated Aug 26, 2025 • 3.5M • • 4.31k
openai/gpt-oss-20b

Text Generation • 22B • Updated Aug 26, 2025 • 6.63M • • 4.16k
openai/BrowseCompLongContext

Viewer • Updated Aug 9, 2025 • 295 • 548 • 45
baichuan-inc/Baichuan-M2-32B

Text Generation • 33B • Updated 12 days ago • 156k • • 117

Releases August 2

stepfun-ai/step3

Image-Text-to-Text • 321B • Updated Aug 2, 2025 • 71.1k • 164
nunchaku-tech/nunchaku-flux.1-krea-dev

Text-to-Image • Updated Nov 16, 2025 • 10.2k • 115
fdtn-ai/Foundation-Sec-8B-Instruct

Text Generation • 8B • Updated Aug 26, 2025 • 2.3k • • 61
Wan-AI/Wan2.2-TI2V-5B-Diffusers

Text-to-Video • Updated Aug 9, 2025 • 77.5k • 103

Releases July 25

Wan-AI/Wan2.2-I2V-A14B

Image-to-Video • Updated Aug 7, 2025 • 10.2k • • 557
allenai/olmOCR-7B-0725

Image-Text-to-Text • 8B • Updated Aug 26, 2025 • 361 • 62
Wan-AI/Wan2.2-T2V-A14B

Text-to-Video • Updated Aug 7, 2025 • 4.2k • • 388
Qwen/Qwen3-235B-A22B-Thinking-2507

Text Generation • 235B • Updated Aug 17, 2025 • 23.6k • • 391

Releases July 18

nvidia/OpenReasoning-Nemotron-32B

Text Generation • 33B • Updated Sep 16, 2025 • 296 • • 121
ByteDance-Seed/Seed-X-RM-7B

Translation • Updated Jul 31, 2025 • 109 • 30
LGAI-EXAONE/EXAONE-4.0-32B

Text Generation • 32B • Updated Aug 4, 2025 • 12.5k • 268
vidore/colqwen-omni-v0.1

Visual Document Retrieval • Updated Jul 17, 2025 • 8.32k • 92

Releases July 11

HuggingFaceTB/SmolLM3-3B

Text Generation • 3B • Updated Sep 10, 2025 • 74.7k • • 863
moonshotai/Kimi-K2-Instruct

Text Generation • 1T • Updated Nov 7, 2025 • 66.5k • • 2.29k
fal/Realism-Detailer-Kontext-Dev-LoRA

Image-to-Image • Updated Jul 7, 2025 • 128 • • 53
Alibaba-NLP/WebSailor-3B

3B • Updated Jul 10, 2025 • 155 • 74

Releases July 4

apple/DiffuCoder-7B-cpGRPO

8B • Updated 28 days ago • 626 • 316
BAAI/MTVCraft

Text-to-Video • Updated Jul 7, 2025 • 24 • 36
kyutai/tts-1.6b-en_fr

Text-to-Speech • Updated Sep 11, 2025 • 97.9k • 360
apple/DiffuCoder-7B-Base

8B • Updated 28 days ago • 161 • 26

Releases June 27

nari-labs/Dia-1.6B-0626

Text-to-Speech • 2B • Updated Jul 3, 2025 • 48.2k • 122
google/gemma-3n-E4B-it

Image-Text-to-Text • 8B • Updated Jul 14, 2025 • 124k • 843
ByteDance/XVerse

Text-to-Image • Updated Jul 1, 2025 • 44 • 89
nvidia/llama-nemoretriever-colembed-3b-v1

Visual Document Retrieval • 4B • Updated 13 days ago • 735 • 70

June 20 Releases

moonshotai/Kimi-VL-A3B-Thinking-2506

Image-Text-to-Text • 16B • Updated Aug 18, 2025 • 170k • 331
mistralai/Mistral-Small-3.2-24B-Instruct-2506

24B • Updated 14 days ago • 174k • 535
kyutai/stt-1b-en_fr

Automatic Speech Recognition • Updated Nov 18, 2025 • 104
google/magenta-realtime

Updated Aug 29, 2025 • 139 • 528

OCR Models & Datasets

opendatalab/OmniDocBench

Viewer • Updated Sep 26, 2025 • 1.36k • 8.19k • 59
nanonets/Nanonets-OCR-s

Image-Text-to-Text • 4B • Updated Jun 20, 2025 • 31.5k • 1.56k
echo840/MonkeyOCR

Image-Text-to-Text • Updated Aug 28, 2025 • 296 • 513
Running on Zero

MCP

Featured

139

Multimodal OCR2

💻

139

nanonets ocr / smoldocling / monkey ocr / typhoon ocr

Releases June 13

ByteDance/LatentSync-1.6

Updated Jun 12, 2025 • 21.7k • 55
V-JEPA 2

Collection

A frontier video understanding model developed by FAIR, Meta, which extends the pretraining objectives of https://ai.meta.com/blog/v-jepa-yann • 8 items • Updated Jun 13, 2025 • 180
nanonets/Nanonets-OCR-s

Image-Text-to-Text • 4B • Updated Jun 20, 2025 • 31.5k • 1.56k
tencent/Hunyuan3D-2.1

Image-to-3D • Updated Oct 17, 2025 • 20.1k • 792

Releases June 6

Qwen/Qwen3-Reranker-4B

Text Ranking • 4B • Updated Jun 9, 2025 • 81.2k • 110
echo840/MonkeyOCR

Image-Text-to-Text • Updated Aug 28, 2025 • 296 • 513
openbmb/MiniCPM4-8B

Text Generation • 8B • Updated Oct 24, 2025 • 1.25k • 280
arcee-ai/Homunculus

Text Generation • 12B • Updated Jun 3, 2025 • 70 • 99

Releases 30 May

All the releases of the week of 30th May.

deepseek-ai/DeepSeek-R1-0528

Text Generation • 685B • Updated May 29, 2025 • 318k • • 2.39k
Running on Zero

Featured

215

BAGEL

🚀

215

Demo for BAGEL
tencent/HunyuanPortrait

Image-to-Video • Updated May 27, 2025 • 73
XiaomiMiMo/MiMo-7B-RL-0530

Text Generation • 8B • Updated Jun 5, 2025 • 244 • 41

Releases 23 May

ByteDance-Seed/BAGEL-7B-MoT

Any-to-Any • 15B • Updated 28 days ago • 627 • 1.17k
mistralai/Devstral-Small-2505

24B • Updated Aug 18, 2025 • 15k • 859
ByteDance/Dolphin

Image-Text-to-Text • 0.4B • Updated Jul 16, 2025 • 2.49k • 510
moondream/moondream-2b-2025-04-14-4bit

Image-Text-to-Text • 1B • Updated May 22, 2025 • 3.54k • 60

May 16 Releases

Qwen/WorldPM-72B

Text Classification • 73B • Updated May 17, 2025 • 91 • 80
Running on Zero

MCP

Featured

1.44k

LTX Video Fast

🎥

1.44k

ultra-fast video model, LTX 0.9.8 13B distilled
BLIP3o/BLIP3o-Pretrain-Long-Caption

Viewer • Updated Jun 26, 2025 • 27.2M • 19.7k • 56
BLIP3o/BLIP3o-Model-8B

14B • Updated Jun 4, 2025 • 698 • 101

May 9 Releases

tencent/HunyuanCustom

Image-to-Video • Updated Jun 6, 2025 • 190
stepfun-ai/Step1X-3D

Updated May 13, 2025 • 105
cognition-ai/Kevin-32B

33B • Updated May 6, 2025 • 732 • 160
ServiceNow-AI/Apriel-Nemotron-15b-Thinker

Text Generation • 15B • Updated Nov 10, 2025 • 351 • 123

Any-to-Any Models, Datasets, Spaces

Running

Featured

79

MMaDA

🌍

79

Demo for MMaDA: Multimodal Large Diffusion Language Models
Running on Zero

Featured

215

BAGEL

🚀

215

Demo for BAGEL
Gen-Verse/MMaDA-8B-Base

Any-to-Any • 8B • Updated May 24, 2025 • 697 • 88
ByteDance-Seed/BAGEL-7B-MoT

Any-to-Any • 15B • Updated 28 days ago • 627 • 1.17k

Releases Apr 21 & May 2

facebook/EdgeTAM

Updated Apr 30, 2025 • 29
nvidia/parakeet-tdt-0.6b-v2

Automatic Speech Recognition • Updated Nov 27, 2025 • 614k • 1.4k
deepseek-ai/DeepSeek-Prover-V2-671B

Text Generation • 685B • Updated Apr 30, 2025 • 604 • • 815
Qwen/Qwen2.5-Omni-3B

Any-to-Any • 6B • Updated Apr 30, 2025 • 208k • 314

InternVL3 HF

OpenGVLab/InternVL3-1B-hf

Image-Text-to-Text • 0.9B • Updated Apr 23, 2025 • 80.5k • 10
OpenGVLab/InternVL3-2B-hf

Image-Text-to-Text • 2B • Updated Apr 23, 2025 • 5.35k • 3
OpenGVLab/InternVL3-8B-hf

Image-Text-to-Text • 8B • Updated Apr 23, 2025 • 17.1k • 9
OpenGVLab/InternVL3-14B-hf

Image-Text-to-Text • 15B • Updated Apr 23, 2025 • 3.76k

April 16 Releases

giskardai/realharm

Viewer • Updated Apr 16, 2025 • 136 • 137 • 12
Junfeng5/Liquid_V1_7B

Any-to-Any • 9B • Updated Mar 20, 2025 • 1.98k • 95

Multimodal DSE Retrievers

A collection of DSE models for multimodal retrieval

racineai/Flantier-SmolVLM-2B-dse

2B • Updated Jun 18, 2025 • 6 • 11
MrLight/dse-qwen2-2b-mrl-v1

Visual Document Retrieval • Updated Feb 26, 2025 • 9.18k • 65
marco/mcdse-2b-v1

2B • Updated Oct 29, 2024 • 2.9k • 56
llamaindex/vdr-2b-multi-v1

Image-to-Text • 2B • Updated May 21, 2025 • 1.62k • 123

April 11 Releases

moonshotai/Kimi-VL-A3B-Thinking

Image-Text-to-Text • 16B • Updated Aug 18, 2025 • 8.56k • 443
agentica-org/DeepCoder-14B-Preview

Text Generation • 15B • Updated May 11, 2025 • 886 • • 681
HiDream-ai/HiDream-I1-Full

Text-to-Image • Updated Jul 17, 2025 • 8.89k • • 982
OpenGVLab/InternVL3-78B

Image-Text-to-Text • 78B • Updated Sep 11, 2025 • 213k • 226

March 28 Releases

deepseek-ai/DeepSeek-V3-0324

Text Generation • 685B • Updated Mar 27, 2025 • 228k • • 3.08k
Qwen/Qwen2.5-Omni-7B

Any-to-Any • 11B • Updated Apr 30, 2025 • 156k • 1.84k
google/txgemma-27b-chat

Text Generation • 27B • Updated Apr 10, 2025 • 123 • 56
Running

Featured

364

Qwen2.5 Omni 7B Demo

🏆

364

Generate text and speech responses from text, audio, images, or video input

March 21 Releases

docling-project/SmolDocling-256M-preview

Image-Text-to-Text • 0.3B • Updated Sep 17, 2025 • 51.6k • 1.6k
sesame/csm-1b

Text-to-Speech • Updated Dec 1, 2025 • 25.6k • 2.3k
mistralai/Mistral-Small-3.1-24B-Instruct-2503

24B • Updated 14 days ago • 80.1k • 1.34k
tencent/Hunyuan3D-2mini

Image-to-3D • Updated Oct 17, 2025 • 4k • 105

Türkçe VLMler

Qwen/Qwen2-VL-7B-Instruct

Image-Text-to-Text • 8B • Updated Feb 6, 2025 • 1.14M • • 1.25k
Qwen/Qwen2-VL-2B-Instruct

Image-Text-to-Text • 2B • Updated Jan 12, 2025 • 1.57M • 477
CohereLabs/aya-vision-8b

Image-Text-to-Text • 9B • Updated Oct 30, 2025 • 40.4k • 315
CohereLabs/aya-vision-32b

Image-Text-to-Text • 33B • Updated Oct 30, 2025 • 258 • • 217

Feb 14 Releases 💌

OpenGVLab/InternVideo2_5_Chat_8B

Video-Text-to-Text • 8B • Updated Aug 4, 2025 • 17.2k • 87
AIDC-AI/Ovis2-34B

Image-Text-to-Text • 35B • Updated Aug 15, 2025 • 5.32k • 151
open-r1/OpenR1-Qwen-7B

Text Generation • 8B • Updated May 28, 2025 • 28 • • 54
nomic-ai/nomic-embed-text-v2-moe

Sentence Similarity • 0.5B • Updated Apr 1, 2025 • 835k • 445

Feb 7 Releases 🧣

lerobot/pi0_old

Robotics • 4B • Updated Sep 19, 2025 • 506 • 304
kyutai/hibiki-2b-pytorch-bf16

Translation • Updated May 28, 2025 • 2.77k • 56
Alpha-VLLM/Lumina-Image-2.0

Text-to-Image • Updated Mar 30, 2025 • 2.53k • • 351
adyen/DABstep

Viewer • Updated 7 days ago • 460 • 6.52k • 38

January 31 Releases 🧤

allenai/Llama-3.1-Tulu-3-405B

Text Generation • 406B • Updated Feb 10, 2025 • 128 • 110
Qwen/Qwen2.5-VL-72B-Instruct

Image-Text-to-Text • 73B • Updated Jun 6, 2025 • 62.7k • • 578
mistralai/Mistral-Small-24B-Instruct-2501

24B • Updated Jul 28, 2025 • 963k • 949
deepseek-ai/Janus-Pro-7B

Any-to-Any • Updated Feb 1, 2025 • 56k • 3.55k

Models, Jan 27

Running on Zero

266

Qwen2-VL-7B

🔥

266

Generate text from an image and question
Running

65

UI-TARS

🌖

65

Find click coordinates on images based on instructions
Running

95

Qwen2.5-1M Demo

💻

95

Upload documents and ask questions
Qwen/Qwen2.5-14B-Instruct-1M

Text Generation • 15B • Updated Jan 29, 2025 • 5.04k • • 331

Jan 24 Releases

ostris/Flex.1-alpha

Text-to-Image • Updated Jan 19, 2025 • 816 • 481
Qwen/Qwen2.5-Math-PRM-72B

Text Classification • 73B • Updated Jan 17, 2025 • 18.7k • 72
HuggingFaceTB/SmolVLM-500M-Instruct

Image-Text-to-Text • 0.5B • Updated Apr 8, 2025 • 39.5k • 183
deepseek-ai/DeepSeek-R1

Text Generation • 685B • Updated Mar 27, 2025 • 443k • • 12.9k

Jan 17 Releases ❄️

Models and datasets of the second week of Jan 2025.

openbmb/MiniCPM-o-2_6

Any-to-Any • 9B • Updated Oct 5, 2025 • 81.4k • 1.28k
MiniMaxAI/MiniMax-Text-01

Text Generation • 456B • Updated Jul 3, 2025 • 1.84k • 652
OuteAI/OuteTTS-0.3-1B

Text-to-Speech • 1B • Updated Apr 24, 2025 • 277 • 106
NovaSky-AI/Sky-T1_data_17k

Viewer • Updated Jan 14, 2025 • 16.4k • 267 • 186

Jan 10 Releases 🌨️

vikhyatk/moondream2

Image-Text-to-Text • 2B • Updated Sep 23, 2025 • 2.89M • 1.36k
DAMO-NLP-SG/multimodal_textbook

Updated Mar 17, 2025 • 3.86k • 156
ByteDance/Sa2VA-1B

Image-Text-to-Text • 1B • Updated Sep 8, 2025 • 222 • 29
nvidia/Cosmos-1.0-Autoregressive-4B

Updated Feb 11, 2025 • 51 • 56

Dec 6 Releases 🎄

meta-llama/Llama-3.3-70B-Instruct

Text Generation • 71B • Updated Dec 21, 2024 • 312k • • 2.61k
Qwen/Qwen2-VL-72B

Image-Text-to-Text • 73B • Updated Dec 6, 2024 • 108 • 80
google/paligemma2-3b-pt-224

Image-Text-to-Text • 3B • Updated Dec 5, 2024 • 65.2k • 160
tencent/HunyuanVideo

Text-to-Video • Updated Mar 6, 2025 • 1.04k • • 2.1k

Nov 29 Releases 🌲🌲

HuggingFaceTB/SmolVLM-Instruct

Image-Text-to-Text • 2B • Updated Apr 8, 2025 • 25.4k • 571
Qwen/QwQ-32B-Preview

Text Generation • 33B • Updated Jan 12, 2025 • 9.4k • • 1.74k
nvidia/Hymba-1.5B-Base

Text Generation • 2B • Updated Nov 26, 2025 • 378 • 155
vidore/colsmolvlm-v0.1

Visual Document Retrieval • Updated Mar 14, 2025 • 31 • 54

Nov 22 Releases ❄️

mistralai/Pixtral-Large-Instruct-2411

Updated Jul 28, 2025 • 71 • 430
microsoft/orca-agentinstruct-1M-v1

Viewer • Updated Nov 1, 2024 • 1.05M • 905 • 453
Xkev/Llama-3.2V-11B-cot

Image-Text-to-Text • 11B • Updated Nov 16, 2025 • 659 • 158
jinaai/jina-clip-v2

Feature Extraction • 0.9B • Updated Apr 28, 2025 • 39.2k • 299

Nov 15 Releases 🍂

microsoft/LLM2CLIP-EVA02-L-14-336

Zero-Shot Image Classification • Updated Nov 22, 2024 • 89 • 59
microsoft/LLM2CLIP-EVA02-B-16

Updated Feb 8, 2025 • 130 • 10
PleIAs/common_corpus

Viewer • Updated Jun 10, 2025 • 470M • 41.9k • 321
Qwen/Qwen2.5-Coder-32B-Instruct

Text Generation • 33B • Updated Jan 12, 2025 • 232k • • 1.96k

Nov 1 Releases

Runtime error

86

LongVU

🌖

86

Generate responses to video or image inputs
facebook/MobileLLM-1B

Text Generation • Updated May 5, 2025 • 176 • 121
Vision-CAIR/LongVU_Qwen2_7B

Video-Text-to-Text • 8B • Updated Feb 28, 2025 • 94 • 74
Vision-CAIR/LongVU_Llama3_2_3B_img

Updated Feb 28, 2025 • 10 • 6

MIT Talk 31/10 Papers

NVLM: Open Frontier-Class Multimodal LLMs

Paper • 2409.11402 • Published Sep 17, 2024 • 74
BRAVE: Broadening the visual encoding of vision-language models

Paper • 2404.07204 • Published Apr 10, 2024 • 19
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Paper • 2403.18814 • Published Mar 27, 2024 • 47
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Paper • 2409.17146 • Published Sep 25, 2024 • 121

October 25 Releases

ibm-granite/granite-3.0-8b-instruct

Text Generation • 8B • Updated Dec 19, 2024 • 16.8k • 204
ibm-granite/granite-3.0-2b-instruct

Text Generation • 3B • Updated Dec 19, 2024 • 3.41k • 46
CohereLabs/aya-expanse-8b

Text Generation • 8B • Updated Sep 11, 2025 • 174k • 418
CohereLabs/aya-expanse-32b

Text Generation • 32B • Updated Sep 11, 2025 • 5.58k • • 282

LOTUS 🪷

Runtime error

Featured

101

LOTUS Normal

🌍

101

Generate high-quality predictions from images
Runtime error

78

LOTUS Depth

🚀

78

Generate depth maps from images and videos
jingheya/lotus-depth-g-v1-0

Depth Estimation • Updated Oct 5, 2024 • 10.6k • 26
jingheya/lotus-depth-d-v1-0

Depth Estimation • Updated Oct 5, 2024 • 301 • 5

New Depth Models

Recent depth models

Running on Zero

Featured

193

DepthCrafter

🦀

193

a super consistent video depth model
Paused

Featured

223

Depth Pro

🚀

223

Generate an inverse depth map from an image
Runtime error

78

LOTUS Depth

🚀

78

Generate depth maps from images and videos
apple/DepthPro

Depth Estimation • Updated Feb 28, 2025 • 7.44k • 495

BRAVE Models 🦁

Models mentioned in https://huggingface.co/papers/2404.07204

facebook/dinov2-large

Image Feature Extraction • 0.3B • Updated Sep 6, 2023 • 3.02M • 99
google/flan-t5-xl

3B • Updated Nov 28, 2023 • 134k • 526
google/siglip-large-patch16-384

Zero-Shot Image Classification • 0.7B • Updated Sep 26, 2024 • 14.7k • 11
google/vit-huge-patch14-224-in21k

Image Feature Extraction • 0.6B • Updated Feb 14, 2024 • 25.8k • 22

Computer Vision Backbones 🧩

Collection of useful computer vision backbones to fine-tune. It also includes large image classification models, that can be used as backbone.

microsoft/resnet-50

Image Classification • 25.6M • Updated Feb 13, 2024 • 268k • • 471
google/vit-base-patch16-224-in21k

Image Feature Extraction • 86.4M • Updated Feb 5, 2024 • 1.3M • 392
google/vit-base-patch32-224-in21k

Image Feature Extraction • 88M • Updated Dec 8, 2022 • 5.34k • 19
facebook/dinov2-large

Image Feature Extraction • 0.3B • Updated Sep 6, 2023 • 3.02M • 99

Image Classification Models 🐶 🐱

facebook/deit-base-distilled-patch16-384

Image Classification • 87.6M • Updated Sep 12, 2023 • 383 • 7
facebook/convnextv2-base-1k-224

Image Classification • 88.7M • Updated Feb 17, 2025 • 766 • 4
facebook/deit-base-distilled-patch16-224

Image Classification • Updated Jul 13, 2022 • 10.1k • • 31
google/vit-base-patch32-384

Image Classification • 88.3M • Updated Sep 11, 2023 • 16.6k • • 23

Object Detection Models 🥥

facebook/detr-resnet-50

Object Detection • 41.6M • Updated Apr 10, 2024 • 1.43M • • 917
facebook/detr-resnet-101-dc5

Object Detection • 60.7M • Updated Sep 6, 2023 • 1.37k • 19
facebook/detr-resnet-50-dc5

Object Detection • 41.6M • Updated Sep 7, 2023 • 1.47k • 6
google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 88k • 143

Image Segmentation Models 💜

A collection of instance/semantic/panoptic segmentation models.

facebook/maskformer-swin-large-coco

Image Segmentation • 0.2B • Updated Sep 11, 2023 • 1.3k • • 27
nvidia/segformer-b0-finetuned-ade-512-512

Image Segmentation • 3.75M • Updated Jan 14, 2024 • 187k • • 178
facebook/detr-resnet-50-dc5-panoptic

Image Segmentation • 43M • Updated Sep 11, 2023 • 26 • 3
nvidia/segformer-b5-finetuned-cityscapes-1024-1024

Image Segmentation • Updated Aug 9, 2022 • 86.6k • • 36

Zero-shot Image Classification Models 🖼️

This is a collection for models that can be used for zero-shot image classification.

openai/clip-vit-large-patch14

Zero-Shot Image Classification • 0.4B • Updated Sep 15, 2023 • 7.71M • 1.94k
openai/clip-vit-base-patch32

Zero-Shot Image Classification • Updated Feb 29, 2024 • 14.3M • 829
laion/CLIP-ViT-bigG-14-laion2B-39B-b160k

Zero-Shot Image Classification • Updated Jan 22, 2025 • 85.7k • 303
kakaobrain/align-base

Zero-Shot Image Classification • Updated Mar 8, 2023 • 9.52k • 28

Image-to-Image Models 🎨

Collection of image to image editing, image enhancement (SR, deblur, brighten) and text-to-image adapter models.

timbrooks/instruct-pix2pix

Image-to-Image • Updated Jul 5, 2023 • 85.1k • 1.16k
TencentARC/t2i-adapter-canny-sdxl-1.0

Image-to-Image • Updated Sep 7, 2023 • 3.1k • 52
TencentARC/t2i-adapter-sketch-sdxl-1.0

Image-to-Image • Updated Sep 8, 2023 • 3.79k • 75
CrucibleAI/ControlNetMediaPipeFace

Image-to-Image • Updated May 19, 2023 • 1.18k • 574

Video Classification Models 📺

microsoft/xclip-base-patch32

Video Classification • 0.2B • Updated Feb 4, 2024 • 190k • 106
facebook/timesformer-base-finetuned-k400

Video Classification • Updated Jan 2, 2023 • 24.7k • 42
facebook/timesformer-base-finetuned-k600

Video Classification • Updated Dec 12, 2022 • 6.2k • 12
google/vivit-b-16x2

Video Classification • Updated Aug 3, 2023 • 6.48k • 11

Image-to-Text Models 📝

This collection contains image captioning and OCR models.

Salesforce/blip-image-captioning-large

Image-to-Text • 0.5B • Updated Feb 3, 2025 • 1.48M • 1.44k
Salesforce/blip-image-captioning-base

Image-to-Text • Updated Feb 3, 2025 • 1.62M • 829
microsoft/trocr-base-handwritten

Image-to-Text • 0.3B • Updated Feb 11, 2025 • 133k • 469
microsoft/git-large-coco

Image-to-Text • 0.4B • Updated Jun 26, 2023 • 1.63k • 104

Text-to-Image Models 🥑

stabilityai/stable-diffusion-xl-base-1.0

Text-to-Image • Updated Oct 30, 2023 • 1.8M • • 7.29k
warp-ai/wuerstchen

Text-to-Image • Updated Mar 12, 2024 • 194 • 176
Deci/DeciDiffusion-v1-0

Text-to-Image • Updated Feb 15, 2024 • 25 • 140
stabilityai/stable-diffusion-xl-refiner-1.0

Image-to-Image • Updated Sep 25, 2023 • 329k • 2.02k

Foundation Models for Vision 🧩

Foundation models for computer vision.

Running

110

Grounding DINO Demo

💻

110

Cutting edge open-vocabulary object detection app
Running

Featured

94

Owlv2

👀

94

State-of-the-art Zero-shot Object Detection
Runtime error

Featured

41

BLIP2 with transformers

🌖

41

BLIP2 (cutting edge image captioning) in 🤗transformers
Build error

Featured

377

IDEFICS Playground

🐨

377

Segment Anything Model

This collection contains models and demos of SAM and it's smaller friends.

facebook/sam-vit-huge

Mask Generation • 0.6B • Updated Jan 11, 2024 • 182k • 188
facebook/sam-vit-base

Mask Generation • 93.7M • Updated Jan 11, 2024 • 471k • 159
facebook/sam-vit-large

Mask Generation • 0.3B • Updated Jan 11, 2024 • 115k • 31
Runtime error

43

Grounded SAM

💩

43

OWL-series 🦉

Models and applications of OWL-ViT and OWLv2.

Running

Featured

94

Owlv2

👀

94

State-of-the-art Zero-shot Object Detection
Runtime error

Featured

64

Owl Tracking

⚡

64

Powerful foundation model for zero-shot object tracking
Running

26

Search and Detect (CLIP/OWL-ViT)

🦉

26

Search and detect objects in images using text queries
Running on Zero

Featured

109

OWLSAM

😻

109

State-of-the-art open-vocabulary image segmentation ⚡️

SigLIP

A collection dedicated to SigLIP applications

Running on Zero

Featured

72

Draw To Search Art

🐠

72

Draw/upload image and search among WikiART using SigLIP
Running on CPU Upgrade

23

Compare Clip Siglip

🏃

23

Compare strong zero-shot image classification models
Running on Zero

13

Multilingual Zero Shot Image Clf

🏢

13

Comparing powerful multilingual zero-shot image clf models
BAAI/bunny-phi-2-siglip-lora

Text Generation • Updated Mar 28, 2024 • 198 • 48

Awesome Document AI

A collection of open-source document AI 📄 📝 📈

Runtime error

Featured

84

UDOP

🏃

84

Generate text from document images
Runtime error

40

Pix2struct

📚

40

Play with all the pix2struct variants in this d
Running

26

Compare Docvqa Models

🦀

26

Compare different visual question answering
Runtime error

Featured

289

DocQuery — Document Query Engine

🦉

289

SegGPT

A collection of everything SegGPT.

Images Speak in Images: A Generalist Painter for In-Context Visual Learning

Paper • 2212.02499 • Published Dec 5, 2022
SegGPT: Segmenting Everything In Context

Paper • 2304.03284 • Published Apr 6, 2023 • 1
BAAI/seggpt-vit-large

0.4B • Updated Feb 22, 2024 • 28.5k • 5
BAAI/SegGPT

Updated Apr 21, 2023 • 19

Vision Language Models Papers 🖼️💬📝

Papers about vision-language models, most important ones are on top of the list.

Improved Baselines with Visual Instruction Tuning

Paper • 2310.03744 • Published Oct 5, 2023 • 39
DeepSeek-VL: Towards Real-World Vision-Language Understanding

Paper • 2403.05525 • Published Mar 8, 2024 • 48
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities

Paper • 2308.12966 • Published Aug 24, 2023 • 11
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model

Paper • 2404.01331 • Published Mar 29, 2024 • 27

gvhf/owl

google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 88k • 143
google/owlvit-base-patch16

Zero-Shot Object Detection • Updated Dec 12, 2023 • 19.4k • 13
google/owlvit-large-patch14

Zero-Shot Object Detection • Updated Dec 12, 2023 • 8.92k • 29
google/owlv2-base-patch16

Zero-Shot Object Detection • 0.2B • Updated Apr 15, 2024 • 15.3k • 29

gv-hf/owl

google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 88k • 143
google/owlvit-base-patch16

Zero-Shot Object Detection • Updated Dec 12, 2023 • 19.4k • 13
google/owlvit-large-patch14

Zero-Shot Object Detection • Updated Dec 12, 2023 • 8.92k • 29
google/owlv2-base-patch16

Zero-Shot Object Detection • 0.2B • Updated Apr 15, 2024 • 15.3k • 29

merve/owl2

google/owlvit-base-patch32

Zero-Shot Object Detection • 0.2B • Updated Dec 12, 2023 • 88k • 143
google/owlvit-base-patch16

Zero-Shot Object Detection • Updated Dec 12, 2023 • 19.4k • 13
google/owlvit-large-patch14

Zero-Shot Object Detection • Updated Dec 12, 2023 • 8.92k • 29
google/owlv2-base-patch16

Zero-Shot Object Detection • 0.2B • Updated Apr 15, 2024 • 15.3k • 29

Depth Anything v2 Release

A comprehensive collection on DAv2

depth-anything/Depth-Anything-V2-Small

Depth Estimation • Updated Jul 8, 2024 • 12.5k • 75
depth-anything/Depth-Anything-V2-Large

Depth Estimation • Updated Jul 8, 2024 • 181k • 144
Running on Zero

603

Depth Anything V2

🌖

603

Generate depth maps from images
depth-anything/DA-2K

Viewer • Updated Jun 14, 2024 • 1.04k • 413 • 16

Document VLM Papers

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

Paper • 2407.12594 • Published Jul 17, 2024 • 19

Vision Language Leaderboards

This collection has all the vision language leaderboards.

Running

192

Vidore Leaderboard

🥇

192

Browse and compare visual document retrieval models
Running on CPU Upgrade

956

Open VLM Leaderboard

🌎

956

VLMEvalKit Evaluation Results Collection
Running

Featured

558

Vision Arena (Testing VLMs side-by-side)

🖼

558

Display image analysis results
Running

Featured

85

SEED-Bench Leaderboard

🏆

85

Submit model evaluation results to leaderboard

Video Language Models

A collection of video-language models

Paused

21

Video Llava

🐨

21

Generate descriptions by uploading images or videos
llava-hf/LLaVA-NeXT-Video-7B-hf

Video-Text-to-Text • 7B • Updated Nov 11, 2025 • 55.7k • 121
llava-hf/LLaVA-NeXT-Video-7B-DPO-hf

Video-Text-to-Text • 7B • Updated Nov 11, 2025 • 382 • 11
llava-hf/LLaVA-NeXT-Video-7B-32K-hf

Image-Text-to-Text • 8B • Updated Nov 11, 2025 • 192 • 8

SAM2

All the models and demos for SAM2

merve/sam2-hiera-tiny

Mask Generation • Updated Aug 2, 2024 • 10
merve/sam2-hiera-small

Mask Generation • Updated Aug 2, 2024 • 31 • 2
merve/sam2-hiera-large

Mask Generation • Updated Aug 2, 2024 • 31 • 2
merve/sam2-hiera-base-plus

Mask Generation • Updated Aug 2, 2024 • 77

NVEagle

NVEagle/Eagle-X5-13B

Image-Text-to-Text • 15B • Updated Sep 16, 2024 • 19 • 15
NVEagle/Eagle-X5-13B-Chat

Image-Text-to-Text • 15B • Updated Sep 16, 2024 • 33 • 28
NVEagle/Eagle-X5-7B

Image-Text-to-Text • 9B • Updated Sep 16, 2024 • 47 • 26
Runtime error

64

Eagle X5 13B Chat

🚀

64

Combine text and images to generate responses

Multimodal RAG

vidore/colpali-v1.2

Visual Document Retrieval • Updated Mar 14, 2025 • 331k • 112
Qwen/Qwen2-VL-7B-Instruct

Image-Text-to-Text • 8B • Updated Feb 6, 2025 • 1.14M • • 1.25k
Qwen/Qwen2-VL-2B-Instruct

Image-Text-to-Text • 2B • Updated Jan 12, 2025 • 1.57M • 477
Qwen/Qwen2-72B-Instruct

Text Generation • 73B • Updated Oct 8, 2024 • 20.4k • • 718

Zero-shot Segmentation

sam-hq-team/SegInW

Updated Jul 13, 2023 • 1
xdecoder/X-Decoder

Updated Dec 27, 2023 • 5
xdecoder/SEEM

Updated Dec 30, 2023 • 8
Runtime error

Featured

60

OWLSAM2

🏃

60

merve PRO

AI & ML interests

Recent Activity

Organizations

merve 's collections 81

RF-DETR

BAGEL

LTX Video Fast

MMaDA

BAGEL

Qwen2.5 Omni 7B Demo

Qwen2-VL-7B

UI-TARS

Qwen2.5-1M Demo

LongVU

DepthCrafter

Depth Pro

LOTUS Depth

Grounded SAM

Draw To Search Art

Compare Clip Siglip

Multilingual Zero Shot Image Clf