pytorch
/

Qwen3-32B-FP8

Text Generation

text-generation-inference

Model card Files Files and versions

jerryzh168 commited on 7 days ago

Commit

5c8218f

·

verified ·

1 Parent(s): 586ec20

Update README.md

Files changed (1) hide show

README.md +8 -1

README.md CHANGED Viewed

@@ -236,6 +236,8 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
 | latency (batch_size=1)           | 8.93s          | 5.16s (1.73x speedup)         |
 | latency (batch_size=256)         | 33.85s         | 16.15s (2.10x speedup)        |
 <details>
 <summary> Reproduce latency benchmarks </summary>
@@ -245,8 +247,13 @@ git clone [email protected]:vllm-project/vllm.git
 cd vllm
 VLLM_USE_PRECOMPILED=1 pip install --editable .
 ```
 **2. Latency benchmarking**
 ```Shell
 export MODEL=Qwen/Qwen3-32B # or pytorch/Qwen3-32B-FP8
 VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model $MODEL --batch-size 1

 | latency (batch_size=1)           | 8.93s          | 5.16s (1.73x speedup)         |
 | latency (batch_size=256)         | 33.85s         | 16.15s (2.10x speedup)        |
+Note: tested with `fbgemm-gpu-genai` installed.
 <details>
 <summary> Reproduce latency benchmarks </summary>
 cd vllm
 VLLM_USE_PRECOMPILED=1 pip install --editable .
 ```
+To use fbgemm kernels:
+```Shell
+pip install fbgemm-gpu-genai
+```
 **2. Latency benchmarking**
 ```Shell
 export MODEL=Qwen/Qwen3-32B # or pytorch/Qwen3-32B-FP8
 VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model $MODEL --batch-size 1