merve PRO
AI & ML interests
Recent Activity
Organizations
-
moonshotai/Kimi-VL-A3B-Thinking
Image-Text-to-Text • 16B • Updated • 5.16k • 436 -
agentica-org/DeepCoder-14B-Preview
Text Generation • 15B • Updated • 65.2k • • 671 -
HiDream-ai/HiDream-I1-Full
Text-to-Image • Updated • 135k • • 961 -
OpenGVLab/InternVL3-78B
Image-Text-to-Text • 78B • Updated • 428k • 217
-
NVLM: Open Frontier-Class Multimodal LLMs
Paper • 2409.11402 • Published • 75 -
BRAVE: Broadening the visual encoding of vision-language models
Paper • 2404.07204 • Published • 19 -
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Paper • 2403.18814 • Published • 48 -
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper • 2409.17146 • Published • 122
-
facebook/dinov2-large
Image Feature Extraction • 0.3B • Updated • 994k • 92 -
google/flan-t5-xl
3B • Updated • 371k • 517 -
google/siglip-large-patch16-384
Zero-Shot Image Classification • 0.7B • Updated • 18.4k • 8 -
google/vit-huge-patch14-224-in21k
Image Feature Extraction • 0.6B • Updated • 27.9k • 21
-
facebook/deit-base-distilled-patch16-384
Image Classification • 0.1B • Updated • 725 • 5 -
facebook/convnextv2-base-1k-224
Image Classification • 0.1B • Updated • 232 • • 3 -
facebook/deit-base-distilled-patch16-224
Image Classification • Updated • 13.2k • • 27 -
google/vit-base-patch32-384
Image Classification • 0.1B • Updated • 2.31k • • 23
-
facebook/maskformer-swin-large-coco
Image Segmentation • 0.2B • Updated • 2.81k • • 26 -
nvidia/segformer-b0-finetuned-ade-512-512
Image Segmentation • 0.0B • Updated • 197k • • 164 -
facebook/detr-resnet-50-dc5-panoptic
Image Segmentation • 0.0B • Updated • 43 • • 3 -
nvidia/segformer-b5-finetuned-cityscapes-1024-1024
Image Segmentation • Updated • 184k • • 30
-
Salesforce/blip-image-captioning-large
Image-to-Text • 0.5B • Updated • 1.34M • 1.39k -
Salesforce/blip-image-captioning-base
Image-to-Text • Updated • 1.98M • 768 -
microsoft/trocr-base-handwritten
Image-to-Text • 0.3B • Updated • 705k • 429 -
microsoft/git-large-coco
Image-to-Text • 0.4B • Updated • 7.8k • 104
-
Running9090
Owlv2
👀State-of-the-art Zero-shot Object Detection
-
Running on Zero6464
Owl Tracking
⚡Powerful foundation model for zero-shot object tracking
-
Running2525
Search and Detect (CLIP/OWL-ViT)
🦉Search and detect objects in images using text queries
-
Running on Zero102102
OWLSAM
😻State-of-the-art open-vocabulary image segmentation ⚡️
-
Improved Baselines with Visual Instruction Tuning
Paper • 2310.03744 • Published • 38 -
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper • 2403.05525 • Published • 47 -
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper • 2308.12966 • Published • 9 -
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model
Paper • 2404.01331 • Published • 28
-
google/owlvit-base-patch32
Zero-Shot Object Detection • 0.2B • Updated • 118k • 137 -
google/owlvit-base-patch16
Zero-Shot Object Detection • Updated • 6.73k • 12 -
google/owlvit-large-patch14
Zero-Shot Object Detection • Updated • 30.1k • 25 -
google/owlv2-base-patch16
Zero-Shot Object Detection • 0.2B • Updated • 19.6k • 27
-
Running168168
Vidore Leaderboard
🥇Explore visual document retrieval benchmark results
-
Running on CPU Upgrade862862
Open VLM Leaderboard
🌎VLMEvalKit Evaluation Results Collection
-
Running556556
Vision Arena (Testing VLMs side-by-side)
🖼Analyze images to detect and label objects
-
Running8585
SEED-Bench Leaderboard
🏆
-
deepseek-ai/DeepSeek-V3-0324
Text Generation • 685B • Updated • 402k • • 3.05k -
Qwen/Qwen2.5-Omni-7B
Any-to-Any • 11B • Updated • 146k • 1.76k -
google/txgemma-27b-chat
Text Generation • 27B • Updated • 861 • 54 -
Running340340
Qwen2.5 Omni 7B Demo
🏆Generate text and speech responses from various inputs
-
Qwen/Qwen2-VL-7B-Instruct
Image-Text-to-Text • 8B • Updated • 530k • • 1.22k -
Qwen/Qwen2-VL-2B-Instruct
Image-Text-to-Text • 2B • Updated • 2.39M • 439 -
CohereLabs/aya-vision-8b
Image-Text-to-Text • 9B • Updated • 49.2k • • 307 -
CohereLabs/aya-vision-32b
Image-Text-to-Text • 33B • Updated • 165 • • 214
-
Running on Zero255255
Qwen2-VL-7B
🔥Generate text by combining an image and a question
-
Running5858
UI-TARS
🌖Select coordinates on an image based on instructions
-
Running8888
Qwen2.5-1M Demo
💻Upload documents and ask questions
-
Qwen/Qwen2.5-14B-Instruct-1M
Text Generation • 15B • Updated • 15.5k • • 318
-
ibm-granite/granite-3.0-8b-instruct
Text Generation • 8B • Updated • 30.7k • 201 -
ibm-granite/granite-3.0-2b-instruct
Text Generation • 3B • Updated • 4.36k • 46 -
CohereLabs/aya-expanse-8b
Text Generation • 8B • Updated • 14.5k • • 396 -
CohereLabs/aya-expanse-32b
Text Generation • 32B • Updated • 7.9k • • 267
-
microsoft/resnet-50
Image Classification • 0.0B • Updated • 160k • • 434 -
google/vit-base-patch16-224-in21k
Image Feature Extraction • 0.1B • Updated • 1.96M • 366 -
google/vit-base-patch32-224-in21k
Image Feature Extraction • 0.1B • Updated • 55.2k • 19 -
facebook/dinov2-large
Image Feature Extraction • 0.3B • Updated • 994k • 92
-
facebook/detr-resnet-50
Object Detection • 0.0B • Updated • 337k • • 887 -
facebook/detr-resnet-101-dc5
Object Detection • 0.1B • Updated • 5.6k • 19 -
facebook/detr-resnet-50-dc5
Object Detection • 0.0B • Updated • 1.98k • 6 -
google/owlvit-base-patch32
Zero-Shot Object Detection • 0.2B • Updated • 118k • 137
-
openai/clip-vit-large-patch14
Zero-Shot Image Classification • 0.4B • Updated • 9.32M • 1.84k -
openai/clip-vit-base-patch32
Zero-Shot Image Classification • Updated • 17.9M • 746 -
laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
Zero-Shot Image Classification • Updated • 881k • 289 -
kakaobrain/align-base
Zero-Shot Image Classification • Updated • 25.3k • 26
-
microsoft/xclip-base-patch32
Video Classification • 0.2B • Updated • 178k • 97 -
facebook/timesformer-base-finetuned-k400
Video Classification • Updated • 22.6k • 42 -
facebook/timesformer-base-finetuned-k600
Video Classification • Updated • 4k • 12 -
google/vivit-b-16x2
Video Classification • Updated • 556 • 11
-
Running on Zero7171
Draw To Search Art
🐠Draw/upload image and search among WikiART using SigLIP
-
Running on CPU Upgrade2222
Compare Clip Siglip
🏃Compare strong zero-shot image classification models
-
Running on Zero1313
Multilingual Zero Shot Image Clf
🏢Comparing powerful multilingual zero-shot image clf models
-
BAAI/bunny-phi-2-siglip-lora
Text Generation • Updated • 40 • 48
-
google/owlvit-base-patch32
Zero-Shot Object Detection • 0.2B • Updated • 118k • 137 -
google/owlvit-base-patch16
Zero-Shot Object Detection • Updated • 6.73k • 12 -
google/owlvit-large-patch14
Zero-Shot Object Detection • Updated • 30.1k • 25 -
google/owlv2-base-patch16
Zero-Shot Object Detection • 0.2B • Updated • 19.6k • 27
-
google/owlvit-base-patch32
Zero-Shot Object Detection • 0.2B • Updated • 118k • 137 -
google/owlvit-base-patch16
Zero-Shot Object Detection • Updated • 6.73k • 12 -
google/owlvit-large-patch14
Zero-Shot Object Detection • Updated • 30.1k • 25 -
google/owlv2-base-patch16
Zero-Shot Object Detection • 0.2B • Updated • 19.6k • 27
-
Running2121
Video Llava
🐨Generate descriptions by uploading images or videos
-
llava-hf/LLaVA-NeXT-Video-7B-hf
Video-Text-to-Text • 7B • Updated • 95.9k • 106 -
llava-hf/LLaVA-NeXT-Video-7B-DPO-hf
Video-Text-to-Text • 7B • Updated • 1.46k • 9 -
llava-hf/LLaVA-NeXT-Video-7B-32K-hf
Image-Text-to-Text • 8B • Updated • 1.08k • 7
-
moonshotai/Kimi-VL-A3B-Thinking
Image-Text-to-Text • 16B • Updated • 5.16k • 436 -
agentica-org/DeepCoder-14B-Preview
Text Generation • 15B • Updated • 65.2k • • 671 -
HiDream-ai/HiDream-I1-Full
Text-to-Image • Updated • 135k • • 961 -
OpenGVLab/InternVL3-78B
Image-Text-to-Text • 78B • Updated • 428k • 217
-
deepseek-ai/DeepSeek-V3-0324
Text Generation • 685B • Updated • 402k • • 3.05k -
Qwen/Qwen2.5-Omni-7B
Any-to-Any • 11B • Updated • 146k • 1.76k -
google/txgemma-27b-chat
Text Generation • 27B • Updated • 861 • 54 -
Running340340
Qwen2.5 Omni 7B Demo
🏆Generate text and speech responses from various inputs
-
Qwen/Qwen2-VL-7B-Instruct
Image-Text-to-Text • 8B • Updated • 530k • • 1.22k -
Qwen/Qwen2-VL-2B-Instruct
Image-Text-to-Text • 2B • Updated • 2.39M • 439 -
CohereLabs/aya-vision-8b
Image-Text-to-Text • 9B • Updated • 49.2k • • 307 -
CohereLabs/aya-vision-32b
Image-Text-to-Text • 33B • Updated • 165 • • 214
-
Running on Zero255255
Qwen2-VL-7B
🔥Generate text by combining an image and a question
-
Running5858
UI-TARS
🌖Select coordinates on an image based on instructions
-
Running8888
Qwen2.5-1M Demo
💻Upload documents and ask questions
-
Qwen/Qwen2.5-14B-Instruct-1M
Text Generation • 15B • Updated • 15.5k • • 318
-
NVLM: Open Frontier-Class Multimodal LLMs
Paper • 2409.11402 • Published • 75 -
BRAVE: Broadening the visual encoding of vision-language models
Paper • 2404.07204 • Published • 19 -
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Paper • 2403.18814 • Published • 48 -
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
Paper • 2409.17146 • Published • 122
-
ibm-granite/granite-3.0-8b-instruct
Text Generation • 8B • Updated • 30.7k • 201 -
ibm-granite/granite-3.0-2b-instruct
Text Generation • 3B • Updated • 4.36k • 46 -
CohereLabs/aya-expanse-8b
Text Generation • 8B • Updated • 14.5k • • 396 -
CohereLabs/aya-expanse-32b
Text Generation • 32B • Updated • 7.9k • • 267
-
facebook/dinov2-large
Image Feature Extraction • 0.3B • Updated • 994k • 92 -
google/flan-t5-xl
3B • Updated • 371k • 517 -
google/siglip-large-patch16-384
Zero-Shot Image Classification • 0.7B • Updated • 18.4k • 8 -
google/vit-huge-patch14-224-in21k
Image Feature Extraction • 0.6B • Updated • 27.9k • 21
-
microsoft/resnet-50
Image Classification • 0.0B • Updated • 160k • • 434 -
google/vit-base-patch16-224-in21k
Image Feature Extraction • 0.1B • Updated • 1.96M • 366 -
google/vit-base-patch32-224-in21k
Image Feature Extraction • 0.1B • Updated • 55.2k • 19 -
facebook/dinov2-large
Image Feature Extraction • 0.3B • Updated • 994k • 92
-
facebook/deit-base-distilled-patch16-384
Image Classification • 0.1B • Updated • 725 • 5 -
facebook/convnextv2-base-1k-224
Image Classification • 0.1B • Updated • 232 • • 3 -
facebook/deit-base-distilled-patch16-224
Image Classification • Updated • 13.2k • • 27 -
google/vit-base-patch32-384
Image Classification • 0.1B • Updated • 2.31k • • 23
-
facebook/detr-resnet-50
Object Detection • 0.0B • Updated • 337k • • 887 -
facebook/detr-resnet-101-dc5
Object Detection • 0.1B • Updated • 5.6k • 19 -
facebook/detr-resnet-50-dc5
Object Detection • 0.0B • Updated • 1.98k • 6 -
google/owlvit-base-patch32
Zero-Shot Object Detection • 0.2B • Updated • 118k • 137
-
facebook/maskformer-swin-large-coco
Image Segmentation • 0.2B • Updated • 2.81k • • 26 -
nvidia/segformer-b0-finetuned-ade-512-512
Image Segmentation • 0.0B • Updated • 197k • • 164 -
facebook/detr-resnet-50-dc5-panoptic
Image Segmentation • 0.0B • Updated • 43 • • 3 -
nvidia/segformer-b5-finetuned-cityscapes-1024-1024
Image Segmentation • Updated • 184k • • 30
-
openai/clip-vit-large-patch14
Zero-Shot Image Classification • 0.4B • Updated • 9.32M • 1.84k -
openai/clip-vit-base-patch32
Zero-Shot Image Classification • Updated • 17.9M • 746 -
laion/CLIP-ViT-bigG-14-laion2B-39B-b160k
Zero-Shot Image Classification • Updated • 881k • 289 -
kakaobrain/align-base
Zero-Shot Image Classification • Updated • 25.3k • 26
-
microsoft/xclip-base-patch32
Video Classification • 0.2B • Updated • 178k • 97 -
facebook/timesformer-base-finetuned-k400
Video Classification • Updated • 22.6k • 42 -
facebook/timesformer-base-finetuned-k600
Video Classification • Updated • 4k • 12 -
google/vivit-b-16x2
Video Classification • Updated • 556 • 11
-
Salesforce/blip-image-captioning-large
Image-to-Text • 0.5B • Updated • 1.34M • 1.39k -
Salesforce/blip-image-captioning-base
Image-to-Text • Updated • 1.98M • 768 -
microsoft/trocr-base-handwritten
Image-to-Text • 0.3B • Updated • 705k • 429 -
microsoft/git-large-coco
Image-to-Text • 0.4B • Updated • 7.8k • 104
-
Running9090
Owlv2
👀State-of-the-art Zero-shot Object Detection
-
Running on Zero6464
Owl Tracking
⚡Powerful foundation model for zero-shot object tracking
-
Running2525
Search and Detect (CLIP/OWL-ViT)
🦉Search and detect objects in images using text queries
-
Running on Zero102102
OWLSAM
😻State-of-the-art open-vocabulary image segmentation ⚡️
-
Running on Zero7171
Draw To Search Art
🐠Draw/upload image and search among WikiART using SigLIP
-
Running on CPU Upgrade2222
Compare Clip Siglip
🏃Compare strong zero-shot image classification models
-
Running on Zero1313
Multilingual Zero Shot Image Clf
🏢Comparing powerful multilingual zero-shot image clf models
-
BAAI/bunny-phi-2-siglip-lora
Text Generation • Updated • 40 • 48
-
Improved Baselines with Visual Instruction Tuning
Paper • 2310.03744 • Published • 38 -
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper • 2403.05525 • Published • 47 -
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper • 2308.12966 • Published • 9 -
LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model
Paper • 2404.01331 • Published • 28
-
google/owlvit-base-patch32
Zero-Shot Object Detection • 0.2B • Updated • 118k • 137 -
google/owlvit-base-patch16
Zero-Shot Object Detection • Updated • 6.73k • 12 -
google/owlvit-large-patch14
Zero-Shot Object Detection • Updated • 30.1k • 25 -
google/owlv2-base-patch16
Zero-Shot Object Detection • 0.2B • Updated • 19.6k • 27
-
google/owlvit-base-patch32
Zero-Shot Object Detection • 0.2B • Updated • 118k • 137 -
google/owlvit-base-patch16
Zero-Shot Object Detection • Updated • 6.73k • 12 -
google/owlvit-large-patch14
Zero-Shot Object Detection • Updated • 30.1k • 25 -
google/owlv2-base-patch16
Zero-Shot Object Detection • 0.2B • Updated • 19.6k • 27
-
google/owlvit-base-patch32
Zero-Shot Object Detection • 0.2B • Updated • 118k • 137 -
google/owlvit-base-patch16
Zero-Shot Object Detection • Updated • 6.73k • 12 -
google/owlvit-large-patch14
Zero-Shot Object Detection • Updated • 30.1k • 25 -
google/owlv2-base-patch16
Zero-Shot Object Detection • 0.2B • Updated • 19.6k • 27
-
Running168168
Vidore Leaderboard
🥇Explore visual document retrieval benchmark results
-
Running on CPU Upgrade862862
Open VLM Leaderboard
🌎VLMEvalKit Evaluation Results Collection
-
Running556556
Vision Arena (Testing VLMs side-by-side)
🖼Analyze images to detect and label objects
-
Running8585
SEED-Bench Leaderboard
🏆
-
Running2121
Video Llava
🐨Generate descriptions by uploading images or videos
-
llava-hf/LLaVA-NeXT-Video-7B-hf
Video-Text-to-Text • 7B • Updated • 95.9k • 106 -
llava-hf/LLaVA-NeXT-Video-7B-DPO-hf
Video-Text-to-Text • 7B • Updated • 1.46k • 9 -
llava-hf/LLaVA-NeXT-Video-7B-32K-hf
Image-Text-to-Text • 8B • Updated • 1.08k • 7