pytorch
/

Qwen3-8B-INT4

Text Generation

text-generation-inference

Model card Files Files and versions

jerryzh168 commited on May 13

Commit

e3c16b0

·

verified ·

1 Parent(s): 8004dfc

Update README.md

Files changed (1) hide show

README.md +12 -6

README.md CHANGED Viewed

@@ -307,12 +307,14 @@ Run the benchmarks under `vllm` root folder:
 ### baseline
 ```Shell
-python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model microsoft/Phi-4-mini-instruct --batch-size 1
 ```
 ### int4wo-hqq
 ```Shell
-VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-int4wo-hqq --batch-size 1
 ```
 ## benchmark_serving
@@ -334,23 +336,27 @@ Note: you can change the number of prompts to be benchmarked with `--num-prompts
 ### baseline
 Server:
 ```Shell
-vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
 ```
 Client:
 ```Shell
-python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
 ```
 ### int4wo-hqq
 Server:
 ```Shell
-VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3 --pt-load-map-location cuda:0
 ```
 Client:
 ```Shell
-python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-int4wo-hqq --num-prompts 1
 ```

 ### baseline
 ```Shell
+export MODEL=Qwen/Qwen3-8B
+python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model $MODEL --batch-size 1
 ```
 ### int4wo-hqq
 ```Shell
+export MODEL=pytorch/Qwen3-8B-int4wo-hqq
+VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model $MODEL --batch-size 1
 ```
 ## benchmark_serving
 ### baseline
 Server:
 ```Shell
+export MODEL=Qwen/Qwen3-8B
+vllm serve $MODEL --tokenizer microsoft/Phi-4-mini-instruct -O3
 ```
 Client:
 ```Shell
+export MODEL=Qwen/Qwen3-8B
+python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer $MODEL --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model $MODEL --num-prompts 1
 ```
 ### int4wo-hqq
 Server:
 ```Shell
+export MODEL=pytorch/Qwen3-8B-int4wo-hqq
+VLLM_DISABLE_COMPILE_CACHE=1 vllm serve $MODEL --tokenizer microsoft/Phi-4-mini-instruct -O3 --pt-load-map-location cuda:0
 ```
 Client:
 ```Shell
+export MODEL=pytorch/Qwen3-8B-int4wo-hqq
+python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer $MODEL --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model $MODEL --num-prompts 1
 ```