jerryzh168 commited on
Commit
7e49f57
·
verified ·
1 Parent(s): 2b436a2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -0
README.md CHANGED
@@ -222,6 +222,8 @@ lm_eval --model hf --model_args pretrained=$MODEL --tasks hellaswag --device cud
222
  ```
223
  </details>
224
 
 
 
225
  # Peak Memory Usage
226
 
227
  ## Results
@@ -232,6 +234,7 @@ lm_eval --model hf --model_args pretrained=$MODEL --tasks hellaswag --device cud
232
  | Peak Memory (GB) | 16.47 | 6.27 (62% reduction) |
233
 
234
 
 
235
  <details>
236
  <summary> Reproduce Peak Memory Usage Results </summary>
237
 
@@ -279,6 +282,8 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
279
 
280
  </details>
281
 
 
 
282
  # Model Performance
283
 
284
  Our int4wo is only optimized for batch size 1, so expect some slowdown with larger batch sizes, we expect this to be used in local server deployment for single or a few users where the decode tokens per second will matters more than the time to first token.
@@ -293,6 +298,8 @@ Our int4wo is only optimized for batch size 1, so expect some slowdown with larg
293
  Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
294
  Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
295
 
 
 
296
  <details>
297
  <summary> Reproduce Model Performance Results </summary>
298
 
@@ -367,6 +374,8 @@ python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --
367
  ```
368
  </details>
369
 
 
 
370
  # Disclaimer
371
  PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.
372
 
 
222
  ```
223
  </details>
224
 
225
+
226
+
227
  # Peak Memory Usage
228
 
229
  ## Results
 
234
  | Peak Memory (GB) | 16.47 | 6.27 (62% reduction) |
235
 
236
 
237
+
238
  <details>
239
  <summary> Reproduce Peak Memory Usage Results </summary>
240
 
 
282
 
283
  </details>
284
 
285
+
286
+
287
  # Model Performance
288
 
289
  Our int4wo is only optimized for batch size 1, so expect some slowdown with larger batch sizes, we expect this to be used in local server deployment for single or a few users where the decode tokens per second will matters more than the time to first token.
 
298
  Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
299
  Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
300
 
301
+
302
+
303
  <details>
304
  <summary> Reproduce Model Performance Results </summary>
305
 
 
374
  ```
375
  </details>
376
 
377
+
378
+
379
  # Disclaimer
380
  PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.
381