Jayok6 commited on
Commit
f3ed44d
·
verified ·
1 Parent(s): d20c3ca

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -1
README.md CHANGED
@@ -45,7 +45,7 @@ Baichuan-M2 incorporates three core technical innovations: First, through the **
45
 
46
  ### General Performance
47
 
48
- | Benchmark | Baichuan-M2-32B | Qwen3-32B |
49
  |-----------|-----------------|-----------|
50
  | AIME24 | 83.4 | 81.4 |
51
  | AIME25 | 72.9 | 72.9 |
@@ -75,10 +75,19 @@ For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.9.0` or to create
75
  ```shell
76
  python -m sglang.launch_server --model-path baichuan-inc/Baichuan-M2-32B-GPTQ-Int4 --reasoning-parser qwen3
77
  ```
 
 
 
 
 
78
  - vLLM:
79
  ```shell
80
  vllm serve baichuan-inc/Baichuan-M2-32B-GPTQ-Int4 --reasoning-parser qwen3
81
  ```
 
 
 
 
82
 
83
  ## MTP inference with SGLang
84
 
 
45
 
46
  ### General Performance
47
 
48
+ | Benchmark | Baichuan-M2-32B | Qwen3-32B (Thinking) |
49
  |-----------|-----------------|-----------|
50
  | AIME24 | 83.4 | 81.4 |
51
  | AIME25 | 72.9 | 72.9 |
 
75
  ```shell
76
  python -m sglang.launch_server --model-path baichuan-inc/Baichuan-M2-32B-GPTQ-Int4 --reasoning-parser qwen3
77
  ```
78
+ To turn on kv cache FP8 quantization:
79
+ ```shell
80
+ python -m sglang.launch_server --model-path baichuan-inc/Baichuan-M2-32B-GPTQ-Int4 --reasoning-parser qwen3 --kv-cache-dtype fp8_e4m3 --attention-backend flashinfer
81
+ ```
82
+
83
  - vLLM:
84
  ```shell
85
  vllm serve baichuan-inc/Baichuan-M2-32B-GPTQ-Int4 --reasoning-parser qwen3
86
  ```
87
+ To turn on kv cache FP8 quantization:
88
+ ```shell
89
+ vllm serve baichuan-inc/Baichuan-M2-32B-GPTQ-Int4 --reasoning-parser qwen3 --kv_cache_dtype fp8_e4m3
90
+ ```
91
 
92
  ## MTP inference with SGLang
93