baichuan-inc
/

Baichuan-M2-32B-GPTQ-Int4

Text Generation

text-generation-inference

4-bit precision

Model card Files Files and versions Community

Jayok6 commited on 14 days ago

Commit

f3ed44d

·

verified ·

1 Parent(s): d20c3ca

Update README.md

Files changed (1) hide show

README.md +10 -1

README.md CHANGED Viewed

@@ -45,7 +45,7 @@ Baichuan-M2 incorporates three core technical innovations: First, through the **
 ### General Performance
-| Benchmark | Baichuan-M2-32B | Qwen3-32B |
 |-----------|-----------------|-----------|
 | AIME24 | 83.4 | 81.4 |
 | AIME25 | 72.9 | 72.9 |
@@ -75,10 +75,19 @@ For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.9.0` or to create
     ```shell
     python -m sglang.launch_server --model-path baichuan-inc/Baichuan-M2-32B-GPTQ-Int4 --reasoning-parser qwen3
     ```
 - vLLM:
     ```shell
     vllm serve baichuan-inc/Baichuan-M2-32B-GPTQ-Int4  --reasoning-parser qwen3
     ```
 ## MTP inference with SGLang

 ### General Performance
+| Benchmark | Baichuan-M2-32B | Qwen3-32B (Thinking) |
 |-----------|-----------------|-----------|
 | AIME24 | 83.4 | 81.4 |
 | AIME25 | 72.9 | 72.9 |
     ```shell
     python -m sglang.launch_server --model-path baichuan-inc/Baichuan-M2-32B-GPTQ-Int4 --reasoning-parser qwen3
     ```
+To turn on kv cache FP8 quantization:
+    ```shell
+    python -m sglang.launch_server --model-path baichuan-inc/Baichuan-M2-32B-GPTQ-Int4 --reasoning-parser qwen3 --kv-cache-dtype fp8_e4m3 --attention-backend flashinfer
+    ```
 - vLLM:
     ```shell
     vllm serve baichuan-inc/Baichuan-M2-32B-GPTQ-Int4  --reasoning-parser qwen3
     ```
+To turn on kv cache FP8 quantization:
+    ```shell
+    vllm serve baichuan-inc/Baichuan-M2-32B-GPTQ-Int4  --reasoning-parser qwen3 --kv_cache_dtype fp8_e4m3
+    ```
 ## MTP inference with SGLang