qingpei commited on
Commit
86c336c
·
verified ·
1 Parent(s): ac02922

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +363 -0
README.md ADDED
@@ -0,0 +1,363 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ base_model:
4
+ - inclusionAI/Ming-Lite-Omni
5
+ pipeline_tag: any-to-any
6
+ ---
7
+
8
+ # Ming-Lite-Omni v1.5
9
+
10
+
11
+ <p align="center">📑 <a href="https://arxiv.org/abs/2506.09344">Technical Report</a>|📖<a href="https://lucaria-academy.github.io/Ming-Omni/">Project Page</a> |🤗 <a href="https://huggingface.co/inclusionAI/Ming-Lite-Omni-1.5">Hugging Face</a>| 🤖 <a href="https://www.modelscope.cn/models/inclusionAI/Ming-Lite-Omni-1.5">ModelScope</a>
12
+
13
+
14
+
15
+ ## Introduction
16
+
17
+ Ming-lite-omni v1.5 is a comprehensive upgrade to the full-modal capabilities of [Ming-lite-omni](https://github.com/inclusionAI/Ming/tree/v1.0). It significantly improves performance across tasks including image-text understanding, document understanding, video understanding, speech understanding and synthesis, and image generation and editing. Built upon [Ling-lite-1.5](https://github.com/inclusionAI/Ling), Ming-lite-omni v1.5 has a total of 20.3 billion parameters, with 3 billion active parameters in its MoE (Mixture-of-Experts) section. It demonstrates highly competitive results in various modal benchmarks compared to industry-leading models.
18
+
19
+
20
+ ## 📌 Updates
21
+
22
+ * [2025.07.15] 🔥 We release [Ming-lite-omni v1.5](https://inclusionai.github.io/blog/ming-lite-omni-1_5/) with significant improvements across all modalities.
23
+ * [2025.06.12] 🔥 Our [Technical Report](https://arxiv.org/abs/2506.09344) is in public on arxiv.
24
+ * [2025.05.28] 🔥 The official version of [Ming-lite-omni v1](https://github.com/inclusionAI/Ming/tree/v1.0) is released, with better performance and image generation support.
25
+ * [2025.05.04] 🔥 We release the test version of Ming-lite-omni:[Ming-lite-omni-Preview](https://github.com/inclusionAI/Ming/tree/Ming-Lite-Omni-Preview).
26
+
27
+
28
+ ## Key Features
29
+ Compared to [Ming-lite-omni](https://github.com/inclusionAI/Ming/tree/v1.0), Ming-lite-omni v1.5 features key optimizations in the following 3 areas:
30
+ - **Enhanced Video Understanding—MRoPE & Curriculum Learning**: Ming-lite-omni v1.5 significantly improves video understanding through MRoPE's 3D spatiotemporal encoding and a curriculum learning strategy for handling long videos, enabling precise comprehension of complex visual sequences.
31
+ - **Refined Multi-modal Generation-Consistency & Perception Control**: Ming-lite-omni v1.5 offers superior generation, featuring dual-branch image generation with ID & Scene Consistency Loss for coherent editing, and perception enhancement for detailed visual control. Its new audio decoder and BPE encoding also deliver high-quality, real-time speech synthesis.
32
+ - **Comprehensive Data Upgrades-Broadened & Refined fine-grained Data**: Ming-lite-omni v1.5's capabilities are built on extensive data upgrades, including new structured text data, expanded high-quality product information, and refined fine-grained visual and speech perception data (including dialects). This provides a richer, more accurate foundation for all modalities.
33
+
34
+
35
+
36
+ ## Evaluation
37
+ In various modality benchmark tests, Ming-lite-omni v1.5 demonstrates highly competitive results compared to industry-leading models of similar scale.
38
+
39
+ ### Image-text Understanding
40
+ Ming-lite-omni v1.5 shows significant improvements in general image-text understanding, visual object localization, and universal object recognition capabilities, providing a more powerful base model for a wide range of visual applications.
41
+
42
+ | Task Type | Dataset | Qwen2.5-VL-7B | Ming-lite-omni v1.5 |
43
+ |------------------|---------------------------------------------------------------------------------------------------|---------------|----------------|
44
+ | **OpenCompass** | AI2D | 84.36 | **84.91** |
45
+ | | HallusionBench | 55.77 | 54.59 |
46
+ | | MMBench_TEST_V11 | 82.75 | 80.73 |
47
+ | | MMMU | 56.56 | 54.33 |
48
+ | | MMStar | 65.27 | 65.07 |
49
+ | | MMVet | 71.61 | **73.99** |
50
+ | | MathVista | 68.10 | **72.00** |
51
+ | | OCRBench | 87.80 | **88.90** |
52
+ | | average | 71.5 | **71.8** |
53
+ | **Localization** | RefCOCO_val/testA/testB | 90.00/92.5.85.4 | **91.40**/**93.2**/**87.1** |
54
+ | | RefCOCO+_val/testA/testB | 84.20/89.1/76.9 | **86.30**/**90.5**/**79.2** |
55
+ | | RefCoCog_val/test | **87.2**/**87.2** | 87.1/87.6 |
56
+ | **Recognition** | General Recognition | 92.42 | **92.53** |
57
+ | | Vertical domains for natural encyclopedias (animals, plants, ingredients, vehicles, dishes, etc.) | 47.79 | **54.27** |
58
+
59
+ ### Document Understanding
60
+ Ming-lite-omni v1.5 generally performs on par with Qwen2.5-VL-7B in complex document understanding tasks. Notably, it achieves SOTA results among models under 10B parameters on OCRBench, which focuses on text-visual understanding, and on ChartQA, which requires in-depth chart visual analysis and logical reasoning.
61
+
62
+ | Task Type | dataset | Qwen2.5-VL-7B | Ming-lite-omni v1 | Ming-lite-omni v1.5 |
63
+ |:---------------------------------------------------|:--------------------| :------------ | :----------- | :------------- |
64
+ | OCR Understanding | ChartQA_test | 87.24 | 85.1 | **88.84** |
65
+ | | DocVQA_test | **95.57** | 93.0 | 93.68 |
66
+ | | TextVQA_val | **85.06** | 82.8 | 82.27 |
67
+ | | OCRBench | 87.8 | 88.4 | **88.90** |
68
+ | | Average | **88.91** | 87.32 | 88.42 |
69
+ | Document Analysis | OmniDocBench↓ en/zh | **30.8**/39.8 | 34 /34.4 | 34.9/**34.9** |
70
+ | OCR Comprehensive Capability | OCRBenchV2 en/zh | 56.3/57.2 | 53.3/52 | 52.1/55.2 |
71
+
72
+
73
+ ### Video Understanding
74
+ Ming-lite-omni v1.5 achieves a leading position among models of its size in video understanding tasks.
75
+
76
+ | Benchmark | Qwen2.5-VL-7B | Qwen2.5-Omni-7B | InternVL3-8B | **Ming-lite-omni v1.5** |
77
+ |:-----------------------| :------------: | :--------------: | :----------: | :----------------: |
78
+ | **VideoMME(w/o subs)** | 65.10 | 64.30 | 66.30 | **67.07** |
79
+ | **VideoMME(w/ subs)** | 71.60 | 72.40 | 68.90 | **72.59** |
80
+ | **VideoMME(avg)** | 68.35 | 68.35 | 67.60 | **69.83** |
81
+ | **MVBench** | 69.60 | 70.30 | **75.40** | 69.43 |
82
+ | **LongVideoBench** | 56.00 | 54.82 | 58.80 | **59.54** |
83
+ | **OvOBench** | 51.10 | 50.46 | 51.91 | **52.17** |
84
+
85
+
86
+ ### Speech Understanding
87
+
88
+ Ming-lite-omni v1.5 further improves upon Ming-lite-omni in speech understanding. It supports English, Mandarin, Cantonese, Sichuanese, Shanghainese, Minnan, and other dialects, maintaining an industry-leading position in open-source English and Mandarin ASR (Automatic Speech Recognition) and Audio QA (Question Answering) tasks.
89
+
90
+ - ASR Task
91
+
92
+ | Model | Average on All/Open-source Benchmarks(↓) | aishell1 | aishell2_test_android | aishell2_test_ios | cv15_zh | fleurs_zh | wenetspeech_testmeeting | wenetspeech_testnet | librispeech_test_clean | librispeech_test_other | multilingual_librispeech | cv15_en | fleurs_en | voxpopuli_v1.0_en | speechio_leaderboard | dialect_hunan | dialect_minnan | dialect_guangyue | dialect_chuanyu | dialect_shanghai | noisy_jrgj | zxb_chat | zxb_govern | zxb_health | zxb_knowledge | zxb_local_live |
93
+ |:-------------------|:-----------------------------------------| :------- | :-------------------- | :---------------- | :------ | :-------- | :---------------------- | :------------------ | :--------------------- | :--------------------- | :----------------------- | :------ | :-------- | :---------------- | :------------------- | :------------ | :------------- | :--------------- | :-------------- | :--------------- | :--------- | :------- | :--------- | :--------- | :------------ | :------------- |
94
+ | Ming-lite-omni-1.5 | 4.67(+0.15)/3.83(+0.05) | 1.3 | 2.47 | 2.46 | 5.66 | 2.87 | 6.19 | 5.24 | 1.25 | 2.61 | 4.14 | 6.95 | 3.28 | 6.43 | 2.81 | 6.96 | 12.74 | 3.7 | 3.8 | 9.95 | 10.9 | 2.6 | 1.77 | 2.97 | 3.41 | 1.88 |
95
+ | Ming-lite-omni | 4.82/3.88 | 1.47 | 2.55 | 2.52 | 6.31 | 2.96 | 5.95 | 5.46 | 1.44 | 2.80 | 4.15 | 6.89 | 3.39 | 5.80 | 2.65 | 7.88 | 13.84 | 4.36 | 4.33 | 10.49 | 11.62 | 2.34 | 1.77 | 3.31 | 3.69 | 2.44 |
96
+ | Qwen2.5-Omni | 8.81/ 4.37 | 1.18 | 2.75 | 2.63 | 5.2 | 3.0 | 5.9 | 7.7 | 1.8 | 3.4 | 7.56 | 7.6 | 4.1 | 5.8 | 2.54 | 29.31 | 53.43 | 10.39 | 7.61 | 32.05 | 11.11 | 3.68 | 2.23 | 4.02 | 3.17 | 2.03 |
97
+ | Qwen2-Audio | 12.34 / 5.41 | 1.53 | 2.92 | 2.92 | 6.9 | 7.5 | 7.16 | 8.42 | 1.6 | 3.6 | 5.40 | 8.6 | 6.90 | 6.84 | - | 25.88 | 123.78 | 7.59 | 7.77 | 31.73 | - | 4.29 | 2.70 | 4.18 | 3.33 | 2.34 |
98
+ | Kimi-Audio | 12.75/4.42 | 0.60 | 2.64 | 2.56 | 7.21 | 2.69 | 6.28 | 5.37 | 1.28 | 2.42 | 5.88 | 10.31 | 4.44 | 7.97 | 2.23 | 31.93 | 80.28 | 41.49 | 6.69 | 60.64 | 24.40 | 2.96 | 2.03 | 2.38 | 1.98 | 2.05 |
99
+
100
+ - Speech QA Task
101
+
102
+ | Model | Average(Open-ended QA) | AlpacaEval | CommonEval | SD-QA | MMSU | OpenBookQA | IFEval | AdvBench |
103
+ |:--------------------------|:-----------------------| :--------- | :--------- | :---- | :---- | :--------- | :----- | :-------- |
104
+ | Ming-lite-omni v1.5[omni] | 4.474(+0.134) | 4.648 | 4.3 | 61.16 | 45.77 | 65.934 | 55.599 | 98.076 |
105
+ | Ming-lite-omni[omni] | 4.34 | 4.63 | 4.06 | 58.84 | 47.53 | 61.98 | 58.36 | 99.04 |
106
+ | MiniCPM-o [omni] | 4.285 | 4.42 | 4.15 | 50.72 | 54.78 | 78.02 | 49.25 | 97.69 |
107
+ | Kimi-Audio [audio] | 4.215 | 4.46 | 3.97 | 63.12 | 62.17 | 83.52 | 61.10 | 100.00 |
108
+ | Qwen2.5-Omni [omni] | 4.21 | 4.49 | 3.93 | 55.71 | 61.32 | 81.10 | 52.87 | 99.42 |
109
+ | GLM-4-Voice [audio] | 3.77 | 4.06 | 3.48 | 43.31 | 40.11 | 52.97 | 24.91 | 88.08 |
110
+ | Qwen2-Audio-chat [audio] | 3.545 | 3.69 | 3.40 | 35.35 | 35.43 | 49.01 | 22.57 | 98.85 |
111
+ | Step-Audio-chat [audio] | 3.49 | 3.99 | 2.99 | 46.84 | 31.87 | 29.19 | 65.77 | 86.73 |
112
+
113
+ ### Speech Generation
114
+ Ming-lite-omni v1.5 shows significant improvement over Ming-lite-omni in English and Mandarin voice cloning tasks.
115
+
116
+ | Model | seed-tts-eval-zh_wer | seed-tts-eval-zh_sim | seed-tts-eval-en_wer | seed-tts-eval-en_sim |
117
+ |:---------------:|:--------------------:|:--------------------:|:--------------------:|:--------------------:|
118
+ | Seed-TTS | 1.11 | 0.796 | 2.24 | 0.762 |
119
+ | MaskGCT | 2.27 | 0.774 | 2.62 | 0.714 |
120
+ | E2 TTS | 1.97 | 0.730 | 2.19 | 0.710 |
121
+ | F5-TTS | 1.56 | 0.741 | 1.83 | 0.647 |
122
+ | CosyVoice 2 | 1.45 | 0.748 | 2.57 | 0.652 |
123
+ | Qwen2.5-Omni-7B | 1.70 | 0.752 | 2.72 | 0.632 |
124
+ | Ming-lite-omni | 1.69 | 0.68 | 4.31 | 0.509 |
125
+ | Ming-lite-omni v1.5 | 1.93 | 0.68 | 3.75 | 0.54 |
126
+
127
+
128
+ ### Image Generation
129
+ Ming-lite-omni v1.5 demonstrates significant advantages in maintaining scene and person ID consistency during human image editing. It also expands its support for perception tasks such as generative segmentation, depth prediction, object detection, and edge contour generation.
130
+
131
+ | Gen-eval | 1-Obj | 2-Obj | Counting | Colors | Position | Color Attr | Avg. |
132
+ |---------------------| :---: | :---: | :---: | :---: |:---: |:---: |:---: |
133
+ | Ming-lite-omni | 0.99 | 0.77 | 0.68 | 0.78 | 0.46 |0.42 |0.64 |
134
+ | Ming-lite-omni v1.5 | 0.99 | 0.93 | 0.86 | 0.87 |0.90 |0.66 |0.87 |
135
+
136
+
137
+
138
+
139
+ + Human image editing
140
+
141
+ | prompt | ours | Qwen-VLo |
142
+ |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| --- | --- |
143
+ | Make the person in the image smile slightly without altering the original structure ![](https://github.com/Biao-Gong/static/blob/main/gen/1752147843685-5b097f6b-b2aa-4baf-abe4-f1abd89265e8.png?raw=true) | ![](https://raw.githubusercontent.com/Biao-Gong/static/refs/heads/main/gen/1752147837185-62077f0c-e7ec-415f-bd34-1c8453253949.webp) | ![](https://raw.githubusercontent.com/Biao-Gong/static/refs/heads/main/gen/1752147953713-703c31c8-2fd1-4c2d-b4bc-6e0f52e70017.webp) |
144
+
145
+ - Generative segmentation
146
+
147
+ | input | Segmentation | Semantic Segmantation | Panoptic Segmentation |
148
+ | --- | --- | --- | --- |
149
+ | ![](https://raw.githubusercontent.com/Biao-Gong/static/refs/heads/main/gen/1752115158022-12254e69-e8c0-43fb-a725-f6730cda22d8.webp) | ![](https://raw.githubusercontent.com/Biao-Gong/static/refs/heads/main/gen/1752115142775-3975827c-4110-445b-af53-e20201d1043a.webp)<br/>prompt: Given the following instructions: little girl, pink, your monitors colors off friend p pink shirt girl; please perform referring segmentation on this image. | ![](https://raw.githubusercontent.com/Biao-Gong/static/refs/heads/main/gen/1752116495974-7708ba3a-5909-46df-82f5-a1bfa1519d4d.webp)<br/>prompt: Please segment different **classes** in this image | ![](https://raw.githubusercontent.com/Biao-Gong/static/refs/heads/main/gen/1752115151406-c4780a97-5f1c-46cd-9a45-d4ef600d0897.webp)<br/>prompt: Please segment different **instances** in this image. |
150
+
151
+
152
+ + Edge contour generation
153
+
154
+ | Input | Depth Map | Detection Box | Contour |
155
+ |---|--------------------------------------------------------------------------------------------------------------------------------------| --- | --- |
156
+ | ![](https://raw.githubusercontent.com/Biao-Gong/static/refs/heads/main/gen/1752466889319-bd19acce-c07d-4664-9890-41e4dff1ba8d.webp) | ![](https://raw.githubusercontent.com/Biao-Gong/static/refs/heads/main/gen/1752466903529-996bcd35-a9a0-484b-98bf-2f2468f4df42.webp) | ![](https://raw.githubusercontent.com/Biao-Gong/static/refs/heads/main/gen/1752466895795-1955ead5-6d94-4142-8d7b-e265352d2bcb.webp) | ![](https://raw.githubusercontent.com/Biao-Gong/static/refs/heads/main/gen/1752467020122-ad8b436c-bb33-4ef0-85b8-cf45ae8c9be1.webp) |
157
+
158
+
159
+
160
+
161
+ ## Model Downloads
162
+
163
+ You can download our latest model from both Huggingface and ModelScope. For previous version model like [Ming-Lite-Omni v1](https://github.com/inclusionAI/Ming/tree/v1.0), Please refer to this [link](https://github.com/inclusionAI/Ming/tree/v1.0?tab=readme-ov-file#model-downloads).
164
+
165
+ <div align="center">
166
+
167
+ | **Model** | **Input modality** | **Oput modality** | **Download** |
168
+ |:-------------------|:----------------------:| :---------------: |:------------------------------------------------------------------------------------------------------------------------------------------------------------:|
169
+ | Ming-Lite-Omni-1.5 | Image,text,video,audio | Image,text,audio | [🤗 HuggingFace](https://huggingface.co/inclusionAI/Ming-Lite-Omni-1.5) <br>[🤖 ModelScope](https://www.modelscope.cn/models/inclusionAI/Ming-Lite-Omni-1.5) |
170
+ </div>
171
+ If you're in mainland China, we strongly recommend you to download our model from 🤖 <a href="https://www.modelscope.cn/models/inclusionAI/Ming-Lite-Omni-1.5">ModelScope</a>.
172
+
173
+ ```
174
+ pip install modelscope
175
+ modelscope download --model inclusionAI/Ming-Lite-Omni-1.5 --local_dir inclusionAI/Ming-Lite-Omni-1.5 --revision master
176
+ ```
177
+
178
+ Note: This download process will take several minutes to several hours, depending on your network conditions.
179
+
180
+
181
+
182
+
183
+
184
+ ## Use Cases
185
+
186
+ Additional demonstration cases are available on our project [page](https://lucaria-academy.github.io/Ming-Omni/).
187
+
188
+
189
+ ## Environment Preparation
190
+
191
+
192
+ ### Installation with pip
193
+ ```shell
194
+ pip install -r requirements.txt
195
+ # for python 3.10
196
+ pip install data/matcha_tts-0.0.5.1-cp310-cp310-linux_x86_64.whl
197
+ # for python 3.8
198
+ # pip install data/matcha_tts-0.0.5.1-cp38-cp38-linux_x86_64.whl
199
+ pip install diffusers==0.33.0
200
+ pip install nvidia-cublas-cu12==12.4.5.8 # for H20 GPU
201
+ ```
202
+
203
+ ### Installation with docker
204
+
205
+ You can also initialize the environment by building the docker image. First clone this repository:
206
+ ```shell
207
+ git clone --depth 1 https://github.com/inclusionAI/Ming.git
208
+ cd Ming
209
+ ```
210
+ Then build the docker image with the provided Dockerfile in `docker/docker-py310-cu121`. This step might take a while:
211
+ ```shell
212
+ docker build -t ming:py310-cu121 docker/docker-py310-cu121
213
+ ```
214
+ At last, start the container with the current repo directory mounted:
215
+ ```shell
216
+ docker run -it --gpus all -v "$(pwd)":/workspace/Ming ming:py310-cu121 ming:py310-cu121 /bin/bash
217
+ ```
218
+ You can run the model with python interface. You may download the huggingface model in the repo directory first (`.../Ming/`) or mount the downloaded model path when starting the container.
219
+
220
+
221
+ ## Example Usage
222
+
223
+ We provide a step-by-step running example:
224
+
225
+ Step 1 - Download the source code
226
+ ```
227
+ git clone https://github.com/inclusionAI/Ming.git
228
+ cd Ming
229
+ ```
230
+ Step 2 - Download the model weights and create a soft link to the source code directory
231
+
232
+ Download our model following [Model Downloads](#model-downloads)
233
+
234
+ ```shell
235
+ mkdir inclusionAI
236
+ ln -s /path/to/inclusionAI/Ming-Lite-Omni-1.5 inclusionAI/Ming-Lite-Omni
237
+ ```
238
+
239
+ Step 3 - Enter the code directory, you can refer to the following codes to run the Ming-Lite-Omni model.
240
+ ```shell
241
+ jupyter notebook cookbook.ipynb
242
+ ```
243
+
244
+ We also provide a simple example on the usage of this repo. For detailed usage, please refer to [cookbook.ipynb](https://github.com/inclusionAI/Ming/blob/main/cookbook.ipynb).
245
+
246
+ ```python
247
+ import torch
248
+ from transformers import AutoProcessor, GenerationConfig
249
+ from modeling_bailingmm import BailingMMNativeForConditionalGeneration
250
+
251
+ # load model
252
+ model = BailingMMNativeForConditionalGeneration.from_pretrained(
253
+ "inclusionAI/Ming-Lite-Omni",
254
+ torch_dtype=torch.bfloat16, # Use bfloat16 for memory efficiency
255
+ attn_implementation="flash_attention_2",
256
+ load_image_gen=True,
257
+ low_cpu_mem_usage=True # Minimize CPU memory during loading
258
+ ).to("cuda")
259
+
260
+
261
+ # build processor
262
+ processor = AutoProcessor.from_pretrained(".", trust_remote_code=True)
263
+
264
+ # qa
265
+ messages = [
266
+ {
267
+ "role": "HUMAN",
268
+ "content": [
269
+ {"type": "text", "text": "请详细介绍鹦鹉的生活习性。"}
270
+ ],
271
+ },
272
+ ]
273
+
274
+ # 1. Format inputs using chat template
275
+ text = processor.apply_chat_template(messages, add_generation_prompt=True)
276
+
277
+ # 2. Extract vision/audio data
278
+ image_inputs, video_inputs, audio_inputs = processor.process_vision_info(messages)
279
+
280
+ # 3. Prepare tensor inputs
281
+ inputs = processor(
282
+ text=[text],
283
+ images=image_inputs,
284
+ videos=video_inputs,
285
+ audios=audio_inputs,
286
+ return_tensors="pt",
287
+ )
288
+ inputs = inputs.to(model.device)
289
+ for k in inputs.keys():
290
+ if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
291
+ inputs[k] = inputs[k].to(dtype=torch.bfloat16)
292
+
293
+ # 4. Configure generation
294
+ generation_config = GenerationConfig.from_dict({'no_repeat_ngram_size': 10})
295
+ generated_ids = model.generate(
296
+ **inputs,
297
+ max_new_tokens=512,
298
+ use_cache=True,
299
+ eos_token_id=processor.gen_terminator,
300
+ generation_config=generation_config,
301
+ )
302
+ generated_ids_trimmed = [
303
+ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
304
+ ]
305
+
306
+ # 5. Decode output
307
+ output_text = processor.batch_decode(
308
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
309
+ )[0]
310
+ print(output_text)
311
+ # Output:
312
+
313
+ # 鹦鹉是一种非常聪明和社交性强的鸟类,它们的生活习性非常丰富和有趣。以下是一些关于鹦鹉生活习性的详细介绍:
314
+ # ### 1. **栖息地**
315
+ # 鹦鹉主要分布在热带和亚热带地区,包括非洲、亚洲、澳大利亚和南美洲。它们通常生活在森林、草原、沙漠和城市环境中。不同种类的鹦鹉对栖息地的要求有所不同,但大多数鹦鹉喜欢有丰富植被和水源的地方。
316
+ # ### 2. **饮食**
317
+ # 鹦鹉是杂食性动物,它们的饮食非常多样化。它们的食物包括种子、坚果、水果、蔬菜、花蜜和昆虫。鹦鹉的喙非常强壮,能够轻松地打开坚硬的果壳和坚果。一些鹦鹉还会吃泥土或沙子,以帮助消化和补充矿物质。
318
+ # ......
319
+ ```
320
+
321
+ Note: We test the examples on hardware of NVIDIA H800-80GB/H20-96G with CUDA 12.4. Loading inclusionAI/Ming-Lite-Omni-1.5 in bfloat16 takes about 42G GPU memory.
322
+
323
+ ## Gradio Demo
324
+
325
+ We provide a graphical user interface based on Gradio to facilitate the use of Ming-lite-omni.
326
+
327
+ 1. Install Gradio dependencies
328
+
329
+ ```shell
330
+ pip install gradio
331
+ pip install gradio_client
332
+ ```
333
+
334
+ 2. Start the Gradio server
335
+
336
+ ```python
337
+ python gradio_demo.py
338
+ ```
339
+
340
+
341
+
342
+ ## License and Legal Disclaimer
343
+
344
+ This code repository is licensed under the [MIT License](./LICENSE), and the Legal Disclaimer is located in the [LEGAL.md file](./LEGAL.md) under the project's root directory.
345
+
346
+ ## Citation
347
+
348
+ If you find our work helpful, feel free to give us a cite.
349
+
350
+ ```bibtex
351
+
352
+ @misc{Mingomni2025,
353
+ title = {Ming-Omni: A Unified Multimodal Model for Perception and Generation},
354
+ author = {Inclusion AI},
355
+ year = {2025},
356
+ eprint = {2506.09344},
357
+ archivePrefix = {arXiv},
358
+ url = {https://arxiv.org/abs/2506.09344}
359
+ }
360
+ ```
361
+
362
+
363
+