merve
's Collections
MIT Talk 31/10 Papers
updated
NVLM: Open Frontier-Class Multimodal LLMs
Paper
•
2409.11402
•
Published
•
75
BRAVE: Broadening the visual encoding of vision-language models
Paper
•
2404.07204
•
Published
•
19
Mini-Gemini: Mining the Potential of Multi-modality Vision Language
Models
Paper
•
2403.18814
•
Published
•
48
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art
Multimodal Models
Paper
•
2409.17146
•
Published
•
122
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large
Multimodal Models
Paper
•
2407.07895
•
Published
•
43
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Paper
•
2409.01704
•
Published
•
84
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at
Any Resolution
Paper
•
2409.12191
•
Published
•
78
Unifying Multimodal Retrieval via Document Screenshot Embedding
Paper
•
2406.11251
•
Published
•
10
LLaVA-OneVision: Easy Visual Task Transfer
Paper
•
2408.03326
•
Published
•
61
ColPali: Efficient Document Retrieval with Vision Language Models
Paper
•
2407.01449
•
Published
•
51
Paper
•
2410.07073
•
Published
•
67
Building and better understanding vision-language models: insights and
future directions
Paper
•
2408.12637
•
Published
•
132
PaliGemma: A versatile 3B VLM for transfer
Paper
•
2407.07726
•
Published
•
72
Sigmoid Loss for Language Image Pre-Training
Paper
•
2303.15343
•
Published
•
8