linzhao-amd commited on
Commit
00fb3f3
·
verified ·
1 Parent(s): 6ee9f1c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -19,17 +19,17 @@ base_model:
19
  - **KV cache quantization:** OCP FP8
20
  - **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)
21
 
22
- The model is the quantized version of the [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct) model, which is an auto-regressive language model that uses an optimized transformer architecture. For more information, please check [here](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct). The MXFP4 model is quantized with [AMD-Quark](https://quark.docs.amd.com/latest/index.html).
23
 
24
 
25
  # Model Quantization
26
 
27
- This model was obtained by quantizing [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct)'s weights and activations to MXFP4 and KV caches to FP8, using AutoSmoothQuant algorithm in [AMD-Quark](https://quark.docs.amd.com/latest/index.html).
28
 
29
  **Quantization scripts:**
30
  ```
31
  cd Quark/examples/torch/language_modeling/llm_ptq/
32
- python3 quantize_quark.py --model_dir "meta-llama/Meta-Llama-3.1-405B-Instruct" \
33
  --model_attn_implementation "sdpa" \
34
  --quant_scheme w_mxfp4_a_mxfp4 \
35
  --kv_cache_dtype fp8 \
@@ -56,9 +56,9 @@ Evaluation was conducted using the framework [lm-evaluation-harness](https://git
56
  <tr>
57
  <td><strong>Benchmark</strong>
58
  </td>
59
- <td><strong>Meta-Llama-3.1-405B-Instruct </strong>
60
  </td>
61
- <td><strong>Meta-Llama-3.1-405B-Instruct-MXFP4(this model)</strong>
62
  </td>
63
  <td><strong>Recovery</strong>
64
  </td>
 
19
  - **KV cache quantization:** OCP FP8
20
  - **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)
21
 
22
+ The model is the quantized version of the [Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct) model, which is an auto-regressive language model that uses an optimized transformer architecture. For more information, please check [here](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct). The MXFP4 model is quantized with [AMD-Quark](https://quark.docs.amd.com/latest/index.html).
23
 
24
 
25
  # Model Quantization
26
 
27
+ This model was obtained by quantizing [Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct)'s weights and activations to MXFP4 and KV caches to FP8, using AutoSmoothQuant algorithm in [AMD-Quark](https://quark.docs.amd.com/latest/index.html).
28
 
29
  **Quantization scripts:**
30
  ```
31
  cd Quark/examples/torch/language_modeling/llm_ptq/
32
+ python3 quantize_quark.py --model_dir "meta-llama/Llama-3.1-405B-Instruct" \
33
  --model_attn_implementation "sdpa" \
34
  --quant_scheme w_mxfp4_a_mxfp4 \
35
  --kv_cache_dtype fp8 \
 
56
  <tr>
57
  <td><strong>Benchmark</strong>
58
  </td>
59
+ <td><strong>Llama-3.1-405B-Instruct </strong>
60
  </td>
61
+ <td><strong>Llama-3.1-405B-Instruct-MXFP4(this model)</strong>
62
  </td>
63
  <td><strong>Recovery</strong>
64
  </td>