jerryzh168 commited on
Commit
7a964a6
·
verified ·
1 Parent(s): e3c16b0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -6
README.md CHANGED
@@ -42,7 +42,7 @@ sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
42
 
43
  if __name__ == '__main__':
44
  # Create an LLM.
45
- llm = LLM(model="pytorch/Phi-4-mini-instruct-int4wo-hqq")
46
  # Generate texts from the prompts.
47
  # The output is a list of RequestOutput objects
48
  # that contain the prompt, generated text, and other information.
@@ -63,7 +63,8 @@ this is expected be resolved in pytorch 2.8.
63
  ## Serving
64
  Then we can serve with the following command:
65
  ```Shell
66
- vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3
 
67
  ```
68
 
69
 
@@ -84,7 +85,7 @@ from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
84
 
85
  torch.random.manual_seed(0)
86
 
87
- model_path = "pytorch/Phi-4-mini-instruct-int4wo-hqq"
88
 
89
  model = AutoModelForCausalLM.from_pretrained(
90
  model_path,
@@ -282,9 +283,9 @@ Our int4wo is only optimized for batch size 1, so expect some slowdown with larg
282
  ## Results (A100 machine)
283
  | Benchmark (Latency) | | |
284
  |----------------------------------|----------------|--------------------------|
285
- | | Phi-4 mini-Ins | phi4-mini-int4wo-hqq |
286
- | latency (batch_size=1) | 2.46s | 2.2s (12% speedup) |
287
- | serving (num_prompts=1) | 0.87 req/s | 1.05 req/s (20% speedup) |
288
 
289
  Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
290
  Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
 
42
 
43
  if __name__ == '__main__':
44
  # Create an LLM.
45
+ llm = LLM(model="pytorch/Qwen3-8B-int4wo-hqq")
46
  # Generate texts from the prompts.
47
  # The output is a list of RequestOutput objects
48
  # that contain the prompt, generated text, and other information.
 
63
  ## Serving
64
  Then we can serve with the following command:
65
  ```Shell
66
+ export MODEL=pytorch/Qwen3-8B-int4wo-hqq
67
+ vllm serve $MODEL --tokenizer $MODEL -O3
68
  ```
69
 
70
 
 
85
 
86
  torch.random.manual_seed(0)
87
 
88
+ model_path = "pytorch/Qwen3-8B-int4wo-hqq"
89
 
90
  model = AutoModelForCausalLM.from_pretrained(
91
  model_path,
 
283
  ## Results (A100 machine)
284
  | Benchmark (Latency) | | |
285
  |----------------------------------|----------------|--------------------------|
286
+ | | Qwen3-8B | Qwen3-8B-int4wo-hqq |
287
+ | latency (batch_size=1) | TODOs | TODOs (TODO% speedup) |
288
+ | serving (num_prompts=1) | TODO req/s | TODO req/s (20% speedup) |
289
 
290
  Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
291
  Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.