Update README.md
Browse files
README.md
CHANGED
@@ -13,7 +13,7 @@ base_model:
|
|
13 |
pipeline_tag: text-generation
|
14 |
---
|
15 |
|
16 |
-
[Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) float8 dynamic activation and float8 weight quantization (per row granularity), by PyTorch team. Use it directly, or serve using [vLLM](https://docs.vllm.ai/en/latest/) with 47% VRAM reduction (34.54 GB needed), around 1.
|
17 |
|
18 |
# Inference with vLLM
|
19 |
```Shell
|
@@ -164,8 +164,8 @@ lm_eval --model hf --model_args pretrained=$MODEL --tasks mmlu --device cuda:0 -
|
|
164 |
|
165 |
| Memory (tested on H100) | | |
|
166 |
|----------------------------------|----------------|-------------------------------|
|
167 |
-
| | Qwen3-32B | Qwen3-32B-FP8
|
168 |
-
| Peak Memory | 65.
|
169 |
|
170 |
<details>
|
171 |
<summary> Reproduce Peak Memory Usage Results </summary>
|
@@ -232,9 +232,9 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
|
|
232 |
|
233 |
| Benchmark (Tested on H100) | | |
|
234 |
|----------------------------------|----------------|-------------------------------|
|
235 |
-
| | Qwen3-32B | Qwen3-32B-FP8
|
236 |
-
| latency (batch_size=1) |
|
237 |
-
| latency (batch_size=
|
238 |
|
239 |
<details>
|
240 |
<summary> Reproduce latency benchmarks </summary>
|
|
|
13 |
pipeline_tag: text-generation
|
14 |
---
|
15 |
|
16 |
+
[Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) float8 dynamic activation and float8 weight quantization (per row granularity), by PyTorch team. Use it directly, or serve using [vLLM](https://docs.vllm.ai/en/latest/) with 47% VRAM reduction (34.54 GB needed), around 1.7x speedup and little to no accuracy impact on H100.
|
17 |
|
18 |
# Inference with vLLM
|
19 |
```Shell
|
|
|
164 |
|
165 |
| Memory (tested on H100) | | |
|
166 |
|----------------------------------|----------------|-------------------------------|
|
167 |
+
| | Qwen3-32B | Qwen3-32B-FP8 |
|
168 |
+
| Peak Memory | 65.63 GB | 34.71 GB (47.1% reduction) |
|
169 |
|
170 |
<details>
|
171 |
<summary> Reproduce Peak Memory Usage Results </summary>
|
|
|
232 |
|
233 |
| Benchmark (Tested on H100) | | |
|
234 |
|----------------------------------|----------------|-------------------------------|
|
235 |
+
| | Qwen3-32B | Qwen3-32B-FP8 |
|
236 |
+
| latency (batch_size=1) | 8.93s | 5.16s (1.73x speedup) |
|
237 |
+
| latency (batch_size=256) | 33.85s | 16.15s (2.10x speedup) |
|
238 |
|
239 |
<details>
|
240 |
<summary> Reproduce latency benchmarks </summary>
|