Update README.md
Browse files
README.md
CHANGED
@@ -45,7 +45,7 @@ Baichuan-M2 incorporates three core technical innovations: First, through the **
|
|
45 |
|
46 |
### General Performance
|
47 |
|
48 |
-
| Benchmark | Baichuan-M2-32B | Qwen3-32B |
|
49 |
|-----------|-----------------|-----------|
|
50 |
| AIME24 | 83.4 | 81.4 |
|
51 |
| AIME25 | 72.9 | 72.9 |
|
@@ -75,10 +75,19 @@ For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.9.0` or to create
|
|
75 |
```shell
|
76 |
python -m sglang.launch_server --model-path baichuan-inc/Baichuan-M2-32B-GPTQ-Int4 --reasoning-parser qwen3
|
77 |
```
|
|
|
|
|
|
|
|
|
|
|
78 |
- vLLM:
|
79 |
```shell
|
80 |
vllm serve baichuan-inc/Baichuan-M2-32B-GPTQ-Int4 --reasoning-parser qwen3
|
81 |
```
|
|
|
|
|
|
|
|
|
82 |
|
83 |
## MTP inference with SGLang
|
84 |
|
|
|
45 |
|
46 |
### General Performance
|
47 |
|
48 |
+
| Benchmark | Baichuan-M2-32B | Qwen3-32B (Thinking) |
|
49 |
|-----------|-----------------|-----------|
|
50 |
| AIME24 | 83.4 | 81.4 |
|
51 |
| AIME25 | 72.9 | 72.9 |
|
|
|
75 |
```shell
|
76 |
python -m sglang.launch_server --model-path baichuan-inc/Baichuan-M2-32B-GPTQ-Int4 --reasoning-parser qwen3
|
77 |
```
|
78 |
+
To turn on kv cache FP8 quantization:
|
79 |
+
```shell
|
80 |
+
python -m sglang.launch_server --model-path baichuan-inc/Baichuan-M2-32B-GPTQ-Int4 --reasoning-parser qwen3 --kv-cache-dtype fp8_e4m3 --attention-backend flashinfer
|
81 |
+
```
|
82 |
+
|
83 |
- vLLM:
|
84 |
```shell
|
85 |
vllm serve baichuan-inc/Baichuan-M2-32B-GPTQ-Int4 --reasoning-parser qwen3
|
86 |
```
|
87 |
+
To turn on kv cache FP8 quantization:
|
88 |
+
```shell
|
89 |
+
vllm serve baichuan-inc/Baichuan-M2-32B-GPTQ-Int4 --reasoning-parser qwen3 --kv_cache_dtype fp8_e4m3
|
90 |
+
```
|
91 |
|
92 |
## MTP inference with SGLang
|
93 |
|