jerryzh168 commited on
Commit
2b436a2
·
verified ·
1 Parent(s): b1328fe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -19
README.md CHANGED
@@ -186,22 +186,6 @@ and use a token with write access, from https://huggingface.co/settings/tokens
186
  # Model Quality
187
  We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
188
 
189
- Need to install lm-eval from source:
190
- https://github.com/EleutherAI/lm-evaluation-harness#install
191
-
192
- ## baseline
193
- ```Shell
194
- lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
195
- ```
196
-
197
- ## int4 weight only quantization with hqq (int4wo-hqq)
198
- ```Shell
199
- export MODEL=pytorch/Qwen3-8B-int4wo-hqq
200
- # or
201
- # export MODEL=Qwen/Qwen3-8B
202
- lm_eval --model hf --model_args pretrained=$MODEL --tasks hellaswag --device cuda:0 --batch_size 8
203
- ```
204
-
205
  | Benchmark | | |
206
  |----------------------------------|----------------|---------------------------|
207
  | | Qwen3-8B | Qwen3-8B-int4wo |
@@ -217,7 +201,26 @@ lm_eval --model hf --model_args pretrained=$MODEL --tasks hellaswag --device cud
217
  | gsm8k | 87.79 | 86.28 |
218
  | leaderboard_math_hard (v3) | 53.7 | 46.83 |
219
  | **Overall** | 60.02 | 56.33 |
220
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
221
 
222
  # Peak Memory Usage
223
 
@@ -229,7 +232,8 @@ lm_eval --model hf --model_args pretrained=$MODEL --tasks hellaswag --device cud
229
  | Peak Memory (GB) | 16.47 | 6.27 (62% reduction) |
230
 
231
 
232
- ## Code Example
 
233
 
234
  We can use the following code to get a sense of peak memory usage during inference:
235
 
@@ -273,6 +277,8 @@ mem = torch.cuda.max_memory_reserved() / 1e9
273
  print(f"Peak Memory Usage: {mem:.02f} GB")
274
  ```
275
 
 
 
276
  # Model Performance
277
 
278
  Our int4wo is only optimized for batch size 1, so expect some slowdown with larger batch sizes, we expect this to be used in local server deployment for single or a few users where the decode tokens per second will matters more than the time to first token.
@@ -287,6 +293,9 @@ Our int4wo is only optimized for batch size 1, so expect some slowdown with larg
287
  Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
288
  Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
289
 
 
 
 
290
  ## Setup
291
 
292
  Get vllm source code:
@@ -356,7 +365,7 @@ Client:
356
  export MODEL=pytorch/Qwen3-8B-int4wo-hqq
357
  python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer $MODEL --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model $MODEL --num-prompts 1
358
  ```
359
-
360
 
361
  # Disclaimer
362
  PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.
 
186
  # Model Quality
187
  We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
188
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
189
  | Benchmark | | |
190
  |----------------------------------|----------------|---------------------------|
191
  | | Qwen3-8B | Qwen3-8B-int4wo |
 
201
  | gsm8k | 87.79 | 86.28 |
202
  | leaderboard_math_hard (v3) | 53.7 | 46.83 |
203
  | **Overall** | 60.02 | 56.33 |
204
+
205
+ <details>
206
+ <summary> Reproduce Model Quality Results </summary>
207
+
208
+ Need to install lm-eval from source:
209
+ https://github.com/EleutherAI/lm-evaluation-harness#install
210
+
211
+ ## baseline
212
+ ```Shell
213
+ lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
214
+ ```
215
+
216
+ ## int4 weight only quantization with hqq (int4wo-hqq)
217
+ ```Shell
218
+ export MODEL=pytorch/Qwen3-8B-int4wo-hqq
219
+ # or
220
+ # export MODEL=Qwen/Qwen3-8B
221
+ lm_eval --model hf --model_args pretrained=$MODEL --tasks hellaswag --device cuda:0 --batch_size 8
222
+ ```
223
+ </details>
224
 
225
  # Peak Memory Usage
226
 
 
232
  | Peak Memory (GB) | 16.47 | 6.27 (62% reduction) |
233
 
234
 
235
+ <details>
236
+ <summary> Reproduce Peak Memory Usage Results </summary>
237
 
238
  We can use the following code to get a sense of peak memory usage during inference:
239
 
 
277
  print(f"Peak Memory Usage: {mem:.02f} GB")
278
  ```
279
 
280
+ </details>
281
+
282
  # Model Performance
283
 
284
  Our int4wo is only optimized for batch size 1, so expect some slowdown with larger batch sizes, we expect this to be used in local server deployment for single or a few users where the decode tokens per second will matters more than the time to first token.
 
293
  Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
294
  Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
295
 
296
+ <details>
297
+ <summary> Reproduce Model Performance Results </summary>
298
+
299
  ## Setup
300
 
301
  Get vllm source code:
 
365
  export MODEL=pytorch/Qwen3-8B-int4wo-hqq
366
  python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer $MODEL --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model $MODEL --num-prompts 1
367
  ```
368
+ </details>
369
 
370
  # Disclaimer
371
  PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.