rulixiang commited on
Commit
6345362
·
1 Parent(s): 63e1ff3

update README

Browse files
Files changed (1) hide show
  1. README.md +303 -0
README.md ADDED
@@ -0,0 +1,303 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Ming-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning
2
+
3
+ 📖 [Technical Report]() | 🤗 [Hugging Face](https://huggingface.co/inclusionAI/Ming-Reasoning)| 🤖 [ModelScope](https://www.modelscope.cn/models/inclusionAI/Ming-Reasoning)
4
+
5
+ ## Introduction
6
+
7
+ We introduce Ming-Reasoning-7B, a model designed to excel in both general and spatial reasoning. Our approach integrates two key innovations: (1) a novel data pipeline that generates 294.2K high-quality data samples (168K for cold-start fine-tuning and 126.2K for RLVR), which feature logically coherent reasoning trajectories and have undergone comprehensive assessment; and (2) a dynamic multi-task training strategy with step-wise optimization to mitigate conflicts between data, and task-specific rewards for delivering tailored incentive signals. This combination of curated data and advanced training allows Ming-Reasoning-7B to set a new state-of-the-art (SOTA) across 8 benchmarks, showcasing superior performance in both general and spatial reasoning domains.
8
+ ![](assets/teaser.png)
9
+
10
+ ## 📌 Updates
11
+
12
+ <!-- - [2025.07.08] 🔥 Our Technical Report is in public on arxiv. -->
13
+ - [2025.07.07] 🔥 We release Ming-Reasoning 🤗 [Hugging Face](https://huggingface.co/inclusionAI/Ming-Reasoning) and 🤖 [ModelScope](https://www.modelscope.cn/models/inclusionAI/Ming-Reasoning).
14
+
15
+ ## Key Features
16
+
17
+ - Unified Omni-Modality Perception: Ming-lite-omni, built on Ling, an MoE architecture LLM, resolves task conflicts and ensures coherent integration of tokens from different modalities through modality-specific routers.
18
+ - Unified Perception and Generation: Ming-lite-omni achieves unified understanding and generation, enabling the model to interpret multimodal instructions and user intent during generation, which helps enhance generation quality and improves usability across multiple tasks.
19
+ - Innovative Generation Capabilities: Ming-lite-omni can perceive all modalities and generate high-quality text, real-time speech, and vivid images simultaneously, delivering exceptional cross-modal performance across diverse tasks including image perception, audio-visual interaction, and image generation.
20
+
21
+ ## Evaluation
22
+
23
+ We conduct a comprehensive evaluation of our models across two key domains: general and spatial
24
+ reasoning. Our evaluation utilizes a diverse set of public benchmarks, grouped by the primary
25
+ capability they measure:
26
+
27
+ - General Reasoning (Mathematical & Logical): To evaluate this capability, we employ six benchmarks: MathVista, MathVision, MathVerse, DynaMath, WeMath, and LogicVista.
28
+
29
+ |Models| MathVista| MathVision| MathVerse| DynaMath| WeMath| LogicVista| Avg. (Δ)|
30
+ |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
31
+ |***Base-Scale General Models***|
32
+ |InternVL3-8B | 70.5| 30.0| 38.5| 25.7 |39.5 |44.5 |41.4|
33
+ |InternVL3-9B | 69.0 | 29.3| 37.9 |25.1 |34.8| 49.0 |40.8|
34
+ |Qwen2.5-VL-7B |68.1 |25.4 |41.1 |21.8 |36.2| 47.9| 40.1|
35
+ |MUG-U-7B | 74.8 |26.1 |35.4 |17.2 |26.5 |39.8| 36.6|
36
+ |SAIL-VL-1.6-8B | 74.2 |23.2| 33.4 |14.0 |29.6 |41.4| 36.0|
37
+ |***Base-Scale Reasoning Models***|
38
+ |WeThink-VL-7B| 71.6 |26.0| 44.2 |24.8 |**48.0** |**51.2**| 44.3 (+4.2)|
39
+ |Taichu-VLR-7B | 72.3| 27.1 |46.7 |23.0 |44.0 |48.3 |43.6|
40
+ |VLAA-Thinker-7B | 68.0 |26.4| **48.2** |22.4 |41.5 |48.5 |42.5 (+2.4)|
41
+ |URSA-8B-PS-GRPO | 67.8 |**31.8** |41.5 |22.4| 38.3 |44.7 |41.1 (+8.2)|
42
+ |Ovis2-8B |71.8 |25.9| 42.3 |20.4 |27.2 |39.4| 37.8|
43
+ |***Our Models***|
44
+ |Base Model |70.2| 25.9| 30.5| 20.2| 27.2| 37.8| 35.5|
45
+ |Ming-Reasoning-CI-7B| 71.7| 29.2| 42.1| 25.0 |42.8| 46.8 |42.9 (+7.4)|
46
+ |Ming-Reasoning-7B | **75.0** |31.5| 44.7 |**26.8** |41.8 |50.0 |**45.0 (+9.5)**|
47
+
48
+ - Spatial Reasoning: We assess this skill using 2 benchmarks: CV-Bench and VSI-Bench
49
+ - CV-Bench:
50
+
51
+ | Models | Count | Relation | Depth | Distance | Avg. |
52
+ | :--- | :---: | :---: | :---: | :---: | :---: |
53
+ | ***Large-Scale Models*** | | | | | |
54
+ | GPT-4O | 65.9 | 85.7 | 87.8 | 78.2 | 78.9 |
55
+ | Gemini-1.5-pro | 70.4 | 85.2 | 82.4 | 72.8 | 77.4 |
56
+ | ***Base-Scale Models*** | | | | | |
57
+ | InternVL3-8B| **74.0** | 90.6 | 84.3 | 81.0 | 82.0 |
58
+ | Qwen2.5-VL-7B-Instruct | 65.2 | 86.6 | 70.6 | 79.8 | 75.0 |
59
+ | LLava-NEXT-Video-7B | 59.3 | 77.0 | 71.3 | 54.7 | 65.2 |
60
+ | ***Our Models*** | | | | | |
61
+ | Ming-Reasoning-7B | 66.6 | **92.8** | **89.3** | **84.3** | **82.3** |
62
+
63
+ - VSI-Bench:
64
+
65
+ | | OC | AD| OS|RS |RDs |RDr |RP |AO |Avg. |
66
+ | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
67
+ | ***Large-Scale Models*** | | | | | | | | | |
68
+ | Gemini-1.5-pro | 56.2 | 30.9 | 64.1 | 43.6 | 51.3 | 46.3 | 36.0 | 34.6 | 45.4 |
69
+ | GPT-4O | 46.2 | 5.3 | 43.8 | 38.2 | 37.0 | 41.3 | 31.5 | 28.5 | 34.0 |
70
+ | ***Base-Scale Models*** | | | | | | | | | |
71
+ | InternVL3-8B | **68.1** | **39.0** | 48.4 | 33.6 | **48.3** | 36.4 | 27.3 | **35.4** | 42.1 |
72
+ | Video-R1-7B | - | - | - | - | - | - | - | - | 37.1 |
73
+ | Qwen2.5-VL-7B-Instruct| 37.7 | 20.1 | 49.7 | 37.4 | 38.5 | 40.4 | 31.4 | 32.0 | 35.9 |
74
+ | LLava-NeXT-Video-7B| 48.5 | 14.0 | 47.8 | 24.2 | 43.5 | 42.4 | **34.0** | 30.6 | 35.6 |
75
+ | ***Our Models*** | | | | | | | | | |
76
+ | Ming-Reasoning-7B | 41.0 | 34.0 | **60.9** | **55.4** | 40.7 | **47.3** | 29.9 | 28.8 | **42.3** |
77
+
78
+ ## Installation
79
+
80
+ Please download our model following Model Downloads, then you can refer to the following codes to run Ming-Reasoning model.
81
+ The basic environment is `python=3.10`, `torch=2.6.0+cu124`, `transformers=4.49.0`
82
+ ## Example Usage
83
+
84
+ We provide a small example on the usage of this repo. For detailed usage.
85
+
86
+ ``` python
87
+ import os
88
+ import torch
89
+
90
+ from transformers import (
91
+ AutoProcessor,
92
+ AutoTokenizer,
93
+ )
94
+
95
+ import warnings
96
+ import argparse
97
+ from modeling_bailing_qwen2_5 import Bailing_qwen2_5NativeForConditionalGeneration
98
+ from processing_bailing_qwen2_5 import Bailing_qwen2_5Processor
99
+
100
+ warnings.filterwarnings("ignore")
101
+
102
+ class BailingMMInfer:
103
+ def __init__(self,
104
+ model_name_or_path,
105
+ device="cuda",
106
+ max_pixels=None,
107
+ min_pixels=None,
108
+ video_max_pixels=768 * 28 * 28,
109
+ video_min_pixels=128 * 28 * 28,
110
+ generation_config=None
111
+ ):
112
+ super().__init__()
113
+ self.model_name_or_path = model_name_or_path
114
+
115
+ self.device = device
116
+
117
+ self.device_map = device
118
+
119
+ self.video_max_pixels = video_max_pixels if video_max_pixels is not None else 768 * 28 * 28
120
+ self.video_min_pixels = video_min_pixels if video_min_pixels is not None else 128 * 28 * 28
121
+
122
+ self.model, self.tokenizer, self.processor = self.load_model_processor()
123
+ if max_pixels is not None:
124
+ self.processor.max_pixels = max_pixels
125
+ if min_pixels is not None:
126
+ self.processor.min_pixels = min_pixels
127
+ if generation_config is None:
128
+ generation_config = {
129
+ "num_beams": 1,
130
+ "do_sample": True,
131
+ "temperature": 0.9
132
+ }
133
+
134
+ self.generation_config = generation_config
135
+
136
+
137
+ def load_model_processor(self):
138
+
139
+ model = Bailing_qwen2_5NativeForConditionalGeneration.from_pretrained(
140
+ self.model_name_or_path,
141
+ torch_dtype=torch.bfloat16,
142
+ device_map=self.device_map,
143
+ _attn_implementation="flash_attention_2"
144
+ ).eval()
145
+
146
+ tokenizer = AutoTokenizer.from_pretrained(self.model_name_or_path, add_bos_token=True, trust_remote_code=True)
147
+ processor = Bailing_qwen2_5Processor.from_pretrained(self.model_name_or_path, trust_remote_code=True)
148
+
149
+ return model, tokenizer, processor
150
+
151
+ def generate(self, messages, max_new_tokens=512):
152
+ text = self.processor.apply_chat_template(
153
+ messages, tokenize=False, add_generation_prompt=True, use_system=True
154
+ )
155
+
156
+ image_inputs, video_inputs = self.processor.process_vision_info(messages)
157
+
158
+
159
+ inputs = self.processor(
160
+ text=[text],
161
+ images=image_inputs,
162
+ videos=video_inputs,
163
+ return_tensors="pt",
164
+ )
165
+ # print(inputs)
166
+ print(self.tokenizer.decode(inputs['input_ids'][0]))
167
+
168
+ inputs = inputs.to(self.device)
169
+
170
+ for k in inputs.keys():
171
+ if k == "pixel_values" or k == "pixel_values_videos":
172
+ inputs[k] = inputs[k].to(dtype=torch.bfloat16)
173
+
174
+ with torch.no_grad():
175
+ generated_ids = self.model.generate(
176
+ inputs,
177
+ max_new_tokens=max_new_tokens,
178
+ eos_token_id=self.processor.tokenizer.eos_token_id,
179
+ **self.generation_config,
180
+ )
181
+
182
+ generated_ids_trimmed = [
183
+ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
184
+ ]
185
+
186
+ output_text = self.processor.batch_decode(
187
+ generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False
188
+ )[0]
189
+
190
+ return output_text
191
+
192
+ if __name__ == '__main__':
193
+ parser = argparse.ArgumentParser()
194
+ parser.add_argument('--model_name_or_path', type=str, default="inclusionAI/Ming-Reasoning")
195
+ parser.add_argument('--max_pixels', type=int, default=401408)
196
+ parser.add_argument('--min_pixels', type=int, default=401408)
197
+ parser.add_argument('--max_new_tokens', type=int, default=4096)
198
+
199
+ args = parser.parse_args()
200
+
201
+ device = "cuda" if torch.cuda.is_available() else "cpu"
202
+ # model_name_or_path = os.path.join(args.input_dir, args.model_name_or_path)
203
+ bailing2 = BailingMMInfer(
204
+ args.model_name_or_path,
205
+ device=device,
206
+ max_pixels=args.max_pixels,
207
+ min_pixels=args.min_pixels
208
+ )
209
+
210
+ messages = [
211
+ {
212
+ "role": "system",
213
+ "content": [
214
+ {"type": "text", "text": "You are a helpful assistant. When the user asks a question, your response must include two parts: first, the reasoning process enclosed in <think>...</think> tags, then the final answer enclosed in <answer>...</answer> tags. The critical answer or key result should be placed within \\boxed{}."}]},
215
+ {
216
+ "role": "user",
217
+ "content": [
218
+ {"type": "image", "image": "./assets/example1.png"},
219
+ {"type": "text", "text": "\nQuestion:\n\nRhombus $QRST$ has an area of 137.9 square meters. If $RT$ is 12.2 meters, find $QS$.\nA. 11.3\nB. 22.4\nC. 22.6\nD. 25.6"},
220
+ ],
221
+ },
222
+ ]
223
+ output_text = bailing2.generate(messages, max_new_tokens=args.max_new_tokens)
224
+ print(output_text)
225
+
226
+
227
+
228
+ '''
229
+ [Output]:
230
+
231
+ <think>
232
+ To find the length of \( QS \) in the rhombus \( QRST \), we can use the formula for the area of a rhombus, which is given by:
233
+
234
+ \[
235
+ \text{Area} = \frac{1}{2} \times d_1 \times d_2
236
+ \]
237
+
238
+ where \( d_1 \) and \( d_2 \) are the lengths of the diagonals. In this problem, we are given:
239
+ - The area of the rhombus is 137.9 square meters.
240
+ - One of the diagonals, \( RT \), is 12.2 meters.
241
+
242
+ We need to find the length of the other diagonal, \( QS \).
243
+
244
+ Let's denote:
245
+ - \( d_1 = RT = 12.2 \) meters
246
+ - \( d_2 = QS \)
247
+
248
+ Substitute the known values into the area formula:
249
+
250
+ \[
251
+ 137.9 = \frac{1}{2} \times 12.2 \times QS
252
+ \]
253
+
254
+ To solve for \( QS \), first multiply both sides by 2 to eliminate the fraction:
255
+
256
+ \[
257
+ 275.8 = 12.2 \times QS
258
+ \]
259
+
260
+ Next, divide both sides by 12.2:
261
+
262
+ \[
263
+ QS = \frac{275.8}{12.2}
264
+ \]
265
+
266
+ Now, perform the division:
267
+
268
+ \[
269
+ QS \approx 22.6
270
+ \]
271
+
272
+ So, the length of \( QS \) is approximately 22.6 meters.
273
+
274
+ Looking at the options provided:
275
+ A. 11.3
276
+ B. 22.4
277
+ C. 22.6
278
+ D. 25.6
279
+
280
+ The correct answer is C. 22.6.
281
+ </think>
282
+ <answer>
283
+ \boxed{C. 22.6}
284
+ </answer><|im_end|>
285
+ '''
286
+ ```
287
+
288
+ ## License and Legal Disclaimer
289
+
290
+ This code repository is licensed under the MIT License, and the Legal Disclaimer is located in the LEGAL.md file under the project's root directory.
291
+
292
+ ## Citation
293
+
294
+ If you find our work helpful, feel free to give us a cite.
295
+
296
+ ```
297
+ @misc{Mingreasoning2025,
298
+ title = {Ming-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning},
299
+ author = {Inclusion AI},
300
+ year = {2025},
301
+ archivePrefix = {arXiv},
302
+ }
303
+ ```