|
--- |
|
pipeline_tag: image-text-to-text |
|
datasets: |
|
- openbmb/RLAIF-V-Dataset |
|
library_name: transformers |
|
language: |
|
- multilingual |
|
tags: |
|
- minicpm-v |
|
- vision |
|
- ocr |
|
- multi-image |
|
- video |
|
- custom_code |
|
--- |
|
|
|
<h1>A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone</h1> |
|
|
|
[GitHub](https://github.com/OpenBMB/MiniCPM-o) | [Demo](http://211.93.21.133:8889/)</a> |
|
|
|
|
|
|
|
## MiniCPM-V 4.0 |
|
|
|
**MiniCPM-V 4.0** is the latest efficient model in the MiniCPM-V series. The model is built based on SigLIP2-400M and MiniCPM4-3B with a total of 4.1B parameters. It inherits the strong single-image, multi-image and video understanding performance of MiniCPM-V 2.6 with largely improved efficiency. Notable features of MiniCPM-V 4.0 include: |
|
|
|
- 🔥 **Leading Visual Capability.** |
|
With only 4.1B parameters, MiniCPM-V 4.0 achieves an average score of 69.0 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks, **outperforming GPT-4.1-mini-20250414, MiniCPM-V 2.6 (8.1B params, OpenCompass 65.2) and Qwen2.5-VL-3B-Instruct (3.8B params, OpenCompass 64.5)**. It also shows good performance in multi-image understanding and video understanding. |
|
|
|
- 🚀 **Superior Efficiency.** |
|
Designed for on-device deployment, MiniCPM-V 4.0 runs smoothly on end devices. For example, it devlivers **less than 2s first token delay and more than 17 token/s decoding on iPhone 16 Pro Max**, without heating problems. It also shows superior throughput under concurrent requests. |
|
|
|
- 💫 **Easy Usage.** |
|
MiniCPM-V 4.0 can be easily used in various ways including **llama.cpp, Ollama, vLLM, SGLang, LLaMA-Factory and local web demo** etc. We also open-source iOS App that can run on iPhone and iPad. Get started easily with our well-structured [Cookbook](https://github.com/OpenSQZ/MiniCPM-V-CookBook), featuring detailed instructions and practical examples. |
|
|
|
|
|
### Evaluation |
|
|
|
<details> |
|
<summary>Click to view single image results on OpenCompass. </summary> |
|
<div align="center"> |
|
<table style="margin: 0px auto;"> |
|
<thead> |
|
<tr> |
|
<th nowrap="nowrap" align="left">model</th> |
|
<th>Size</th> |
|
<th>Opencompass</th> |
|
<th>OCRBench</th> |
|
<th>MathVista</th> |
|
<th>HallusionBench</th> |
|
<th>MMMU</th> |
|
<th>MMVet</th> |
|
<th>MMBench V1.1</th> |
|
<th>MMStar</th> |
|
<th>AI2D</th> |
|
</tr> |
|
</thead> |
|
<tbody align="center"> |
|
<tr> |
|
<td colspan="11" align="left"><strong>Proprietary</strong></td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">GPT-4v-20240409</td> |
|
<td>-</td> |
|
<td>63.5</td> |
|
<td>656</td> |
|
<td>55.2</td> |
|
<td>43.9</td> |
|
<td>61.7</td> |
|
<td>67.5</td> |
|
<td>79.8</td> |
|
<td>56.0</td> |
|
<td>78.6</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">Gemini-1.5-Pro</td> |
|
<td>-</td> |
|
<td>64.5</td> |
|
<td>754</td> |
|
<td>58.3</td> |
|
<td>45.6</td> |
|
<td>60.6</td> |
|
<td>64.0</td> |
|
<td>73.9</td> |
|
<td>59.1</td> |
|
<td>79.1</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">GPT-4.1-mini-20250414</td> |
|
<td>-</td> |
|
<td>68.9</td> |
|
<td>840</td> |
|
<td>70.9</td> |
|
<td>49.3</td> |
|
<td>55.0</td> |
|
<td>74.3</td> |
|
<td>80.9</td> |
|
<td>60.9</td> |
|
<td>76.0</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">Claude 3.5 Sonnet-20241022</td> |
|
<td>-</td> |
|
<td>70.6</td> |
|
<td>798</td> |
|
<td>65.3</td> |
|
<td>55.5</td> |
|
<td>66.4</td> |
|
<td>70.1</td> |
|
<td>81.7</td> |
|
<td>65.1</td> |
|
<td>81.2</td> |
|
</tr> |
|
<tr> |
|
<td colspan="11" align="left"><strong>Open-source</strong></td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">Qwen2.5-VL-3B-Instruct</td> |
|
<td>3.8B</td> |
|
<td>64.5</td> |
|
<td>828</td> |
|
<td>61.2</td> |
|
<td>46.6</td> |
|
<td>51.2</td> |
|
<td>60.0</td> |
|
<td>76.8</td> |
|
<td>56.3</td> |
|
<td>81.4</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">InternVL2.5-4B</td> |
|
<td>3.7B</td> |
|
<td>65.1</td> |
|
<td>820</td> |
|
<td>60.8</td> |
|
<td>46.6</td> |
|
<td>51.8</td> |
|
<td>61.5</td> |
|
<td>78.2</td> |
|
<td>58.7</td> |
|
<td>81.4</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td> |
|
<td>8.3B</td> |
|
<td>70.9</td> |
|
<td>888</td> |
|
<td>68.1</td> |
|
<td>51.9</td> |
|
<td>58.0</td> |
|
<td>69.7</td> |
|
<td>82.2</td> |
|
<td>64.1</td> |
|
<td>84.3</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">InternVL2.5-8B</td> |
|
<td>8.1B</td> |
|
<td>68.1</td> |
|
<td>821</td> |
|
<td>64.5</td> |
|
<td>49.0</td> |
|
<td>56.2</td> |
|
<td>62.8</td> |
|
<td>82.5</td> |
|
<td>63.2</td> |
|
<td>84.6</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">MiniCPM-V-2.6</td> |
|
<td>8.1B</td> |
|
<td>65.2</td> |
|
<td>852</td> |
|
<td>60.8</td> |
|
<td>48.1</td> |
|
<td>49.8</td> |
|
<td>60.0</td> |
|
<td>78.0</td> |
|
<td>57.5</td> |
|
<td>82.1</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">MiniCPM-o-2.6</td> |
|
<td>8.7B</td> |
|
<td>70.2</td> |
|
<td>889</td> |
|
<td>73.3</td> |
|
<td>51.1</td> |
|
<td>50.9</td> |
|
<td>67.2</td> |
|
<td>80.6</td> |
|
<td>63.3</td> |
|
<td>86.1</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">MiniCPM-V-4.0</td> |
|
<td>4.1B</td> |
|
<td>69.0</td> |
|
<td>894</td> |
|
<td>66.9</td> |
|
<td>50.8</td> |
|
<td>51.2</td> |
|
<td>68.0</td> |
|
<td>79.7</td> |
|
<td>62.8</td> |
|
<td>82.9</td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
</div> |
|
|
|
</details> |
|
|
|
<details> |
|
<summary>Click to view single image results on ChartQA, MME, RealWorldQA, TextVQA, DocVQA, MathVision, DynaMath, WeMath, Object HalBench and MM Halbench. </summary> |
|
|
|
<div align="center"> |
|
<table style="margin: 0px auto;"> |
|
<thead> |
|
<tr> |
|
<th nowrap="nowrap" align="left">model</th> |
|
<th>Size</th> |
|
<th>ChartQA</th> |
|
<th>MME</th> |
|
<th>RealWorldQA</th> |
|
<th>TextVQA</th> |
|
<th>DocVQA</th> |
|
<th>MathVision</th> |
|
<th>DynaMath</th> |
|
<th>WeMath</th> |
|
<th colspan="2">Obj Hal</th> |
|
<th colspan="2">MM Hal</th> |
|
</tr> |
|
</thead> |
|
<tbody> |
|
<tr> |
|
<td></td> |
|
<td></td> |
|
<td></td> |
|
<td></td> |
|
<td></td> |
|
<td></td> |
|
<td></td> |
|
<td></td> |
|
<td></td> |
|
<td></td> |
|
<td>CHAIRs↓</td> |
|
<td>CHAIRi↓</td> |
|
<td nowrap="nowrap">score avg@3↑</td> |
|
<td nowrap="nowrap">hall rate avg@3↓</td> |
|
</tr> |
|
<tbody align="center"> |
|
<tr> |
|
<td colspan="14" align="left"><strong>Proprietary</strong></td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">GPT-4v-20240409</td> |
|
<td>-</td> |
|
<td>78.5</td> |
|
<td>1927</td> |
|
<td>61.4</td> |
|
<td>78.0</td> |
|
<td>88.4</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>-</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">Gemini-1.5-Pro</td> |
|
<td>-</td> |
|
<td>87.2</td> |
|
<td>-</td> |
|
<td>67.5</td> |
|
<td>78.8</td> |
|
<td>93.1</td> |
|
<td>41.0</td> |
|
<td>31.5</td> |
|
<td>50.5</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>-</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">GPT-4.1-mini-20250414</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>45.3</td> |
|
<td>47.7</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>-</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">Claude 3.5 Sonnet-20241022</td> |
|
<td>-</td> |
|
<td>90.8</td> |
|
<td>-</td> |
|
<td>60.1</td> |
|
<td>74.1</td> |
|
<td>95.2</td> |
|
<td>35.6</td> |
|
<td>35.7</td> |
|
<td>44.0</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>-</td> |
|
</tr> |
|
<tr> |
|
<td colspan="14" align="left"><strong>Open-source</strong></td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">Qwen2.5-VL-3B-Instruct</td> |
|
<td>3.8B</td> |
|
<td>84.0</td> |
|
<td>2157</td> |
|
<td>65.4</td> |
|
<td>79.3</td> |
|
<td>93.9</td> |
|
<td>21.9</td> |
|
<td>13.2</td> |
|
<td>22.9</td> |
|
<td>18.3</td> |
|
<td>10.8</td> |
|
<td>3.9 </td> |
|
<td>33.3 </td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">InternVL2.5-4B</td> |
|
<td>3.7B</td> |
|
<td>84.0</td> |
|
<td>2338</td> |
|
<td>64.3</td> |
|
<td>76.8</td> |
|
<td>91.6</td> |
|
<td>18.4</td> |
|
<td>15.2</td> |
|
<td>21.2</td> |
|
<td>13.7</td> |
|
<td>8.7</td> |
|
<td>3.2 </td> |
|
<td>46.5 </td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td> |
|
<td>8.3B</td> |
|
<td>87.3</td> |
|
<td>2347</td> |
|
<td>68.5</td> |
|
<td>84.9</td> |
|
<td>95.7</td> |
|
<td>25.4</td> |
|
<td>21.8</td> |
|
<td>36.2</td> |
|
<td>13.3</td> |
|
<td>7.9</td> |
|
<td>4.1 </td> |
|
<td>31.6 </td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">InternVL2.5-8B</td> |
|
<td>8.1B</td> |
|
<td>84.8</td> |
|
<td>2344</td> |
|
<td>70.1</td> |
|
<td>79.1</td> |
|
<td>93.0</td> |
|
<td>17.0</td> |
|
<td>9.4</td> |
|
<td>23.5</td> |
|
<td>18.3</td> |
|
<td>11.6</td> |
|
<td>3.6 </td> |
|
<td>37.2</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">MiniCPM-V-2.6</td> |
|
<td>8.1B</td> |
|
<td>79.4</td> |
|
<td>2348</td> |
|
<td>65.0</td> |
|
<td>80.1</td> |
|
<td>90.8</td> |
|
<td>17.5</td> |
|
<td>9.0</td> |
|
<td>20.4</td> |
|
<td>7.3</td> |
|
<td>4.7</td> |
|
<td>4.0 </td> |
|
<td>29.9 </td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">MiniCPM-o-2.6</td> |
|
<td>8.7B</td> |
|
<td>86.9</td> |
|
<td>2372</td> |
|
<td>68.1</td> |
|
<td>82.0</td> |
|
<td>93.5</td> |
|
<td>21.7</td> |
|
<td>10.4</td> |
|
<td>25.2</td> |
|
<td>6.3</td> |
|
<td>3.4</td> |
|
<td>4.1 </td> |
|
<td>31.3 </td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">MiniCPM-V-4.0</td> |
|
<td>4.1B</td> |
|
<td>84.4</td> |
|
<td>2298</td> |
|
<td>68.5</td> |
|
<td>80.8</td> |
|
<td>92.9</td> |
|
<td>20.7</td> |
|
<td>14.2</td> |
|
<td>32.7</td> |
|
<td>6.3</td> |
|
<td>3.5</td> |
|
<td>4.1 </td> |
|
<td>29.2 </td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
</div> |
|
|
|
</details> |
|
|
|
<details> |
|
<summary>Click to view multi-image and video understanding results on Mantis, Blink and Video-MME. </summary> |
|
<div align="center"> |
|
<table style="margin: 0px auto;"> |
|
<thead> |
|
<tr> |
|
<th nowrap="nowrap" align="left">model</th> |
|
<th>Size</th> |
|
<th>Mantis</th> |
|
<th>Blink</th> |
|
<th nowrap="nowrap" colspan="2" >Video-MME</th> |
|
</tr> |
|
</thead> |
|
<tbody> |
|
<tr> |
|
<td></td> |
|
<td></td> |
|
<td></td> |
|
<td></td> |
|
<td>wo subs</td> |
|
<td>w subs</td> |
|
</tr> |
|
<tbody align="center"> |
|
<tr> |
|
<td colspan="6" align="left"><strong>Proprietary</strong></td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">GPT-4v-20240409</td> |
|
<td>-</td> |
|
<td>62.7</td> |
|
<td>54.6</td> |
|
<td>59.9</td> |
|
<td>63.3</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">Gemini-1.5-Pro</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>59.1</td> |
|
<td>75.0</td> |
|
<td>81.3</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">GPT-4o-20240513</td> |
|
<td>-</td> |
|
<td>-</td> |
|
<td>68.0</td> |
|
<td>71.9</td> |
|
<td>77.2</td> |
|
</tr> |
|
<tr> |
|
<td colspan="6" align="left"><strong>Open-source</strong></td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">Qwen2.5-VL-3B-Instruct</td> |
|
<td>3.8B</td> |
|
<td>-</td> |
|
<td>47.6</td> |
|
<td>61.5</td> |
|
<td>67.6</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">InternVL2.5-4B</td> |
|
<td>3.7B</td> |
|
<td>62.7</td> |
|
<td>50.8</td> |
|
<td>62.3</td> |
|
<td>63.6</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">Qwen2.5-VL-7B-Instruct</td> |
|
<td>8.3B</td> |
|
<td>-</td> |
|
<td>56.4</td> |
|
<td>65.1</td> |
|
<td>71.6</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">InternVL2.5-8B</td> |
|
<td>8.1B</td> |
|
<td>67.7</td> |
|
<td>54.8</td> |
|
<td>64.2</td> |
|
<td>66.9</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">MiniCPM-V-2.6</td> |
|
<td>8.1B</td> |
|
<td>69.1</td> |
|
<td>53.0</td> |
|
<td>60.9</td> |
|
<td>63.6</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">MiniCPM-o-2.6</td> |
|
<td>8.7B</td> |
|
<td>71.9</td> |
|
<td>56.7</td> |
|
<td>63.9</td> |
|
<td>69.6</td> |
|
</tr> |
|
<tr> |
|
<td nowrap="nowrap" align="left">MiniCPM-V-4.0</td> |
|
<td>4.1B</td> |
|
<td>71.4</td> |
|
<td>54.0</td> |
|
<td>61.2</td> |
|
<td>65.8</td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
</div> |
|
|
|
</details> |
|
|
|
### Examples |
|
|
|
<div style="display: flex; flex-direction: column; align-items: center;"> |
|
<img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4/minicpm-v-4-case.png" alt="math" style="margin-bottom: 5px;"> |
|
</div> |
|
|
|
Run locally on iPhone 16 Pro Max with [iOS demo](https://github.com/OpenSQZ/MiniCPM-V-CookBook/blob/main/demo/ios_demo/ios.md). |
|
|
|
<div align="center"> |
|
<img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4/iphone_en.gif" width="45%" style="display: inline-block; margin: 0 10px;"/> |
|
<img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4/iphone_en_information_extraction.gif" width="45%" style="display: inline-block; margin: 0 10px;"/> |
|
</div> |
|
|
|
<div align="center"> |
|
<img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4/iphone_cn.gif" width="45%" style="display: inline-block; margin: 0 10px;"/> |
|
<img src="https://raw.githubusercontent.com/openbmb/MiniCPM-o/main/assets/minicpmv4/iphone_cn_funny_points.gif" width="45%" style="display: inline-block; margin: 0 10px;"/> |
|
</div> |
|
|
|
## Usage |
|
|
|
```python |
|
from PIL import Image |
|
import torch |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
model_path = 'openbmb/MiniCPM-V-4' |
|
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, |
|
# sdpa or flash_attention_2, no eager |
|
attn_implementation='sdpa', torch_dtype=torch.bfloat16) |
|
model = model.eval().cuda() |
|
tokenizer = AutoTokenizer.from_pretrained( |
|
model_path, trust_remote_code=True) |
|
|
|
|
|
|
|
image = Image.open('./assets/single.png').convert('RGB') |
|
|
|
# First round chat |
|
question = "What is the landform in the picture?" |
|
msgs = [{'role': 'user', 'content': [image, question]}] |
|
|
|
answer = model.chat( |
|
msgs=msgs, |
|
image=image, |
|
tokenizer=tokenizer |
|
) |
|
print(answer) |
|
|
|
|
|
# Second round chat, pass history context of multi-turn conversation |
|
msgs.append({"role": "assistant", "content": [answer]}) |
|
msgs.append({"role": "user", "content": [ |
|
"What should I pay attention to when traveling here?"]}) |
|
|
|
answer = model.chat( |
|
msgs=msgs, |
|
image=None, |
|
tokenizer=tokenizer |
|
) |
|
print(answer) |
|
``` |
|
|
|
|
|
## License |
|
#### Model License |
|
* The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License. |
|
* The usage of MiniCPM-V series model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM-o/blob/main/MiniCPM%20Model%20License.md). |
|
* The models and weights of MiniCPM are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, MiniCPM-V 2.6 weights are also available for free commercial use. |
|
|
|
|
|
#### Statement |
|
* As an LMM, MiniCPM-V 4.0 generates contents by learning a large mount of multimodal corpora, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-V 4.0 does not represent the views and positions of the model developers |
|
* We will not be liable for any problems arising from the use of the MinCPM-V models, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model. |
|
|
|
## Key Techniques and Other Multimodal Projects |
|
|
|
👏 Welcome to explore key techniques of MiniCPM-V 2.6 and other multimodal projects of our team: |
|
|
|
[VisCPM](https://github.com/OpenBMB/VisCPM/tree/main) | [RLHF-V](https://github.com/RLHF-V/RLHF-V) | [LLaVA-UHD](https://github.com/thunlp/LLaVA-UHD) | [RLAIF-V](https://github.com/RLHF-V/RLAIF-V) |
|
|
|
## Citation |
|
|
|
If you find our work helpful, please consider citing our papers 📝 and liking this project ❤️! |
|
|
|
```bib |
|
@article{yao2024minicpm, |
|
title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone}, |
|
author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others}, |
|
journal={Nat Commun 16, 5509 (2025)}, |
|
year={2025} |
|
} |
|
``` |
|
|
|
|