Update README.md
Browse files
README.md
CHANGED
@@ -222,6 +222,8 @@ lm_eval --model hf --model_args pretrained=$MODEL --tasks hellaswag --device cud
|
|
222 |
```
|
223 |
</details>
|
224 |
|
|
|
|
|
225 |
# Peak Memory Usage
|
226 |
|
227 |
## Results
|
@@ -232,6 +234,7 @@ lm_eval --model hf --model_args pretrained=$MODEL --tasks hellaswag --device cud
|
|
232 |
| Peak Memory (GB) | 16.47 | 6.27 (62% reduction) |
|
233 |
|
234 |
|
|
|
235 |
<details>
|
236 |
<summary> Reproduce Peak Memory Usage Results </summary>
|
237 |
|
@@ -279,6 +282,8 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
|
|
279 |
|
280 |
</details>
|
281 |
|
|
|
|
|
282 |
# Model Performance
|
283 |
|
284 |
Our int4wo is only optimized for batch size 1, so expect some slowdown with larger batch sizes, we expect this to be used in local server deployment for single or a few users where the decode tokens per second will matters more than the time to first token.
|
@@ -293,6 +298,8 @@ Our int4wo is only optimized for batch size 1, so expect some slowdown with larg
|
|
293 |
Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
|
294 |
Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
|
295 |
|
|
|
|
|
296 |
<details>
|
297 |
<summary> Reproduce Model Performance Results </summary>
|
298 |
|
@@ -367,6 +374,8 @@ python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --
|
|
367 |
```
|
368 |
</details>
|
369 |
|
|
|
|
|
370 |
# Disclaimer
|
371 |
PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.
|
372 |
|
|
|
222 |
```
|
223 |
</details>
|
224 |
|
225 |
+
|
226 |
+
|
227 |
# Peak Memory Usage
|
228 |
|
229 |
## Results
|
|
|
234 |
| Peak Memory (GB) | 16.47 | 6.27 (62% reduction) |
|
235 |
|
236 |
|
237 |
+
|
238 |
<details>
|
239 |
<summary> Reproduce Peak Memory Usage Results </summary>
|
240 |
|
|
|
282 |
|
283 |
</details>
|
284 |
|
285 |
+
|
286 |
+
|
287 |
# Model Performance
|
288 |
|
289 |
Our int4wo is only optimized for batch size 1, so expect some slowdown with larger batch sizes, we expect this to be used in local server deployment for single or a few users where the decode tokens per second will matters more than the time to first token.
|
|
|
298 |
Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
|
299 |
Int4 weight only is optimized for batch size 1 and short input and output token length, please stay tuned for models optimized for larger batch sizes or longer token length.
|
300 |
|
301 |
+
|
302 |
+
|
303 |
<details>
|
304 |
<summary> Reproduce Model Performance Results </summary>
|
305 |
|
|
|
374 |
```
|
375 |
</details>
|
376 |
|
377 |
+
|
378 |
+
|
379 |
# Disclaimer
|
380 |
PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.
|
381 |
|