update readme
Browse files
README.md
CHANGED
|
@@ -17,7 +17,7 @@ tags:
|
|
| 17 |
|
| 18 |
<h1>A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone</h1>
|
| 19 |
|
| 20 |
-
[GitHub](https://github.com/OpenBMB/MiniCPM-V) |
|
| 21 |
|
| 22 |
|
| 23 |
## MiniCPM-o 2.6
|
|
@@ -40,18 +40,17 @@ Advancing popular visual capabilites from MiniCPM-V series, MiniCPM-o 2.6 can pr
|
|
| 40 |
In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPad.
|
| 41 |
|
| 42 |
- 💫 **Easy Usage.**
|
| 43 |
-
MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](
|
| 44 |
-
) server and [US](https://minicpm-omni-webdemo-us.modelbest.cn/) server.
|
| 45 |
|
| 46 |
|
| 47 |
**Model Architecture.**
|
| 48 |
|
| 49 |
-
- **End-to-end Omni-modal Architecture.** Different modality encoder/decoders are connected and trained in an end-to-end fashion to fully exploit rich multimodal knowledge.
|
| 50 |
-
- **Omni-modal Live Streaming Mechanism.** (1) We change the offline modality encoder/decoders into online ones for streaminig inputs/outputs
|
| 51 |
-
- **Configurable Speech Modeling Design.** We devise a multimodal system prompt, including traditional text system prompt, and a new audio system prompt to determine the assistant voice
|
| 52 |
|
| 53 |
<div align="center">
|
| 54 |
-
<img src="https://github.com/
|
| 55 |
</div>
|
| 56 |
|
| 57 |
### Evaluation <!-- omit in toc -->
|
|
@@ -593,7 +592,7 @@ Note: For proprietary models, we calculate token density based on the image enco
|
|
| 593 |
<td>-</td>
|
| 594 |
<td>-</td>
|
| 595 |
</tr>
|
| 596 |
-
<tr
|
| 597 |
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
| 598 |
<td>8B</td>
|
| 599 |
<td><strong>1.6</strong></td>
|
|
@@ -714,7 +713,7 @@ Note: For proprietary models, we calculate token density based on the image enco
|
|
| 714 |
<td>3.4</td>
|
| 715 |
<td>10.0</td>
|
| 716 |
</tr>
|
| 717 |
-
<tr
|
| 718 |
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
| 719 |
<td>8B</td>
|
| 720 |
<td><u>61.0</u></td>
|
|
@@ -768,7 +767,7 @@ All results are from AudioEvals, and the evaluation methods along with further d
|
|
| 768 |
<td>63</td>
|
| 769 |
<td>46</td>
|
| 770 |
</tr>
|
| 771 |
-
<tr
|
| 772 |
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
| 773 |
<td>57</td>
|
| 774 |
<td>47</td>
|
|
@@ -899,7 +898,7 @@ Note: Mimick Task: Takes audio input, and outputs both an ASR transcription and
|
|
| 899 |
<td>33.4</td>
|
| 900 |
<td>57.7</td>
|
| 901 |
</tr>
|
| 902 |
-
<tr
|
| 903 |
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
| 904 |
<td>8B</td>
|
| 905 |
<td><strong>79.9</strong></td>
|
|
@@ -919,9 +918,9 @@ We deploy MiniCPM-o 2.6 on end devices. The demo video is the raw screen recordi
|
|
| 919 |
|
| 920 |
|
| 921 |
<div style="display: flex; flex-direction: column; align-items: center;">
|
| 922 |
-
<img src="https://github.com/
|
| 923 |
-
<img src="https://github.com/
|
| 924 |
-
<img src="https://github.com/
|
| 925 |
</div>
|
| 926 |
|
| 927 |
|
|
@@ -979,7 +978,7 @@ model.tts.float()
|
|
| 979 |
### Omni mode
|
| 980 |
we provide two inference modes: chat and streaming
|
| 981 |
|
| 982 |
-
####
|
| 983 |
```python
|
| 984 |
import math
|
| 985 |
import numpy as np
|
|
@@ -1044,7 +1043,7 @@ res = model.chat(
|
|
| 1044 |
)
|
| 1045 |
print(res)
|
| 1046 |
```
|
| 1047 |
-
####
|
| 1048 |
```python
|
| 1049 |
# a new conversation need reset session first, it will reset the kv-cache
|
| 1050 |
model.reset_session()
|
|
@@ -1238,7 +1237,7 @@ res = model.chat(
|
|
| 1238 |
|
| 1239 |
`MiniCPM-o-2_6` has the same inference methods as `MiniCPM-V-2_6`
|
| 1240 |
|
| 1241 |
-
####
|
| 1242 |
```python
|
| 1243 |
# test.py
|
| 1244 |
image = Image.open('xx.jpg').convert('RGB')
|
|
|
|
| 17 |
|
| 18 |
<h1>A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone</h1>
|
| 19 |
|
| 20 |
+
[GitHub](https://github.com/OpenBMB/MiniCPM-V) | Online Demo [US](https://minicpm-omni-webdemo-us.modelbest.cn)/[CN](https://minicpm-omni-webdemo.modelbest.cn)</a>
|
| 21 |
|
| 22 |
|
| 23 |
## MiniCPM-o 2.6
|
|
|
|
| 40 |
In addition to its friendly size, MiniCPM-o 2.6 also shows **state-of-the-art token density** (i.e., number of pixels encoded into each visual token). **It produces only 640 tokens when processing a 1.8M pixel image, which is 75% fewer than most models**. This directly improves the inference speed, first-token latency, memory usage, and power consumption. As a result, MiniCPM-o 2.6 can efficiently support **multimodal live streaming** on end-side devices such as iPad.
|
| 41 |
|
| 42 |
- 💫 **Easy Usage.**
|
| 43 |
+
MiniCPM-o 2.6 can be easily used in various ways: (1) [llama.cpp](https://github.com/OpenBMB/llama.cpp/blob/minicpm-omni/examples/llava/README-minicpmo2.6.md) support for efficient CPU inference on local devices, (2) [int4](https://huggingface.co/openbmb/MiniCPM-o-2_6-int4) and [GGUF](https://huggingface.co/openbmb/MiniCPM-o-2_6-gguf) format quantized models in 16 sizes, (3) [vLLM](#efficient-inference-with-llamacpp-ollama-vllm) support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with [LLaMA-Factory](./docs/llamafactory_train.md), (5) quick local WebUI demo setup with [Gradio](#chat-with-our-demo-on-gradio), and (6) online web demo on [CN](https://minicpm-omni-webdemo.modelbest.cn/) server and [US](https://minicpm-omni-webdemo-us.modelbest.cn/) server.
|
|
|
|
| 44 |
|
| 45 |
|
| 46 |
**Model Architecture.**
|
| 47 |
|
| 48 |
+
- **End-to-end Omni-modal Architecture.** Different modality encoder/decoders are connected and trained in an **end-to-end** fashion to fully exploit rich multimodal knowledge.
|
| 49 |
+
- **Omni-modal Live Streaming Mechanism.** (1) We change the offline modality encoder/decoders into online ones for **streaminig inputs/outputs.** (2) We devise a **time-division multiplexing (TDM) mechanism** for omni-modality streaminig processing in the LLM backbone. It divides parallel omni-modality streams into sequential info within small periodic time slices.
|
| 50 |
+
- **Configurable Speech Modeling Design.** We devise a multimodal system prompt, including traditional text system prompt, and **a new audio system prompt to determine the assistant voice**. This enables flexible voice configurations in inference time, and also facilitates voice cloning and description-based voice creation.
|
| 51 |
|
| 52 |
<div align="center">
|
| 53 |
+
<img src="https://github.com/OpenBMB/MiniCPM-V/blob/main/assets/minicpm-o-26-framework.png" , width=80%>
|
| 54 |
</div>
|
| 55 |
|
| 56 |
### Evaluation <!-- omit in toc -->
|
|
|
|
| 592 |
<td>-</td>
|
| 593 |
<td>-</td>
|
| 594 |
</tr>
|
| 595 |
+
<tr>
|
| 596 |
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
| 597 |
<td>8B</td>
|
| 598 |
<td><strong>1.6</strong></td>
|
|
|
|
| 713 |
<td>3.4</td>
|
| 714 |
<td>10.0</td>
|
| 715 |
</tr>
|
| 716 |
+
<tr>
|
| 717 |
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
| 718 |
<td>8B</td>
|
| 719 |
<td><u>61.0</u></td>
|
|
|
|
| 767 |
<td>63</td>
|
| 768 |
<td>46</td>
|
| 769 |
</tr>
|
| 770 |
+
<tr>
|
| 771 |
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
| 772 |
<td>57</td>
|
| 773 |
<td>47</td>
|
|
|
|
| 898 |
<td>33.4</td>
|
| 899 |
<td>57.7</td>
|
| 900 |
</tr>
|
| 901 |
+
<tr>
|
| 902 |
<td nowrap="nowrap" align="left">MiniCPM-o 2.6</td>
|
| 903 |
<td>8B</td>
|
| 904 |
<td><strong>79.9</strong></td>
|
|
|
|
| 918 |
|
| 919 |
|
| 920 |
<div style="display: flex; flex-direction: column; align-items: center;">
|
| 921 |
+
<img src="https://github.com/OpenBMB/MiniCPM-V/blob/main/assets/minicpmo2_6/minicpmo2_6_math_intersect.png" alt="math" style="margin-bottom: 5px;">
|
| 922 |
+
<img src="https://github.com/OpenBMB/MiniCPM-V/blob/main/assets/minicpmo2_6/minicpmo2_6_diagram_train_NN.png" alt="diagram" style="margin-bottom: 5px;">
|
| 923 |
+
<img src="https://github.com/OpenBMB/MiniCPM-V/blob/main/assets/minicpmo2_6/minicpmo2_6_multi-image_bike.png" alt="bike" style="margin-bottom: 5px;">
|
| 924 |
</div>
|
| 925 |
|
| 926 |
|
|
|
|
| 978 |
### Omni mode
|
| 979 |
we provide two inference modes: chat and streaming
|
| 980 |
|
| 981 |
+
#### Chat inference
|
| 982 |
```python
|
| 983 |
import math
|
| 984 |
import numpy as np
|
|
|
|
| 1043 |
)
|
| 1044 |
print(res)
|
| 1045 |
```
|
| 1046 |
+
#### Streaming inference
|
| 1047 |
```python
|
| 1048 |
# a new conversation need reset session first, it will reset the kv-cache
|
| 1049 |
model.reset_session()
|
|
|
|
| 1237 |
|
| 1238 |
`MiniCPM-o-2_6` has the same inference methods as `MiniCPM-V-2_6`
|
| 1239 |
|
| 1240 |
+
#### Chat with single image
|
| 1241 |
```python
|
| 1242 |
# test.py
|
| 1243 |
image = Image.open('xx.jpg').convert('RGB')
|