amd
/

Llama-3.1-405B-Instruct-MXFP4-Preview

8-bit precision

Model card Files Files and versions Community

linzhao-amd commited on 24 days ago

Commit

00fb3f3

·

verified ·

1 Parent(s): 6ee9f1c

Update README.md

Files changed (1) hide show

README.md +5 -5

README.md CHANGED Viewed

@@ -19,17 +19,17 @@ base_model:
   - **KV cache quantization:** OCP FP8
 - **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)
-The model is the quantized version of the [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct) model, which is an auto-regressive language model that uses an optimized transformer architecture. For more information, please check [here](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct). The MXFP4 model is quantized with [AMD-Quark](https://quark.docs.amd.com/latest/index.html).
 # Model Quantization
-This model was obtained by quantizing [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct)'s weights and activations to MXFP4 and KV caches to FP8, using AutoSmoothQuant algorithm in [AMD-Quark](https://quark.docs.amd.com/latest/index.html).
 **Quantization scripts:**
 ```
 cd Quark/examples/torch/language_modeling/llm_ptq/
-python3 quantize_quark.py --model_dir "meta-llama/Meta-Llama-3.1-405B-Instruct" \
                           --model_attn_implementation "sdpa" \
                           --quant_scheme w_mxfp4_a_mxfp4 \
                           --kv_cache_dtype fp8 \
@@ -56,9 +56,9 @@ Evaluation was conducted using the framework [lm-evaluation-harness](https://git
   <tr>
    <td><strong>Benchmark</strong>
    </td>
-   <td><strong>Meta-Llama-3.1-405B-Instruct </strong>
    </td>
-   <td><strong>Meta-Llama-3.1-405B-Instruct-MXFP4(this model)</strong>
    </td>
    <td><strong>Recovery</strong>
    </td>

   - **KV cache quantization:** OCP FP8
 - **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)
+The model is the quantized version of the [Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct) model, which is an auto-regressive language model that uses an optimized transformer architecture. For more information, please check [here](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct). The MXFP4 model is quantized with [AMD-Quark](https://quark.docs.amd.com/latest/index.html).
 # Model Quantization
+This model was obtained by quantizing [Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct)'s weights and activations to MXFP4 and KV caches to FP8, using AutoSmoothQuant algorithm in [AMD-Quark](https://quark.docs.amd.com/latest/index.html).
 **Quantization scripts:**
 ```
 cd Quark/examples/torch/language_modeling/llm_ptq/
+python3 quantize_quark.py --model_dir "meta-llama/Llama-3.1-405B-Instruct" \
                           --model_attn_implementation "sdpa" \
                           --quant_scheme w_mxfp4_a_mxfp4 \
                           --kv_cache_dtype fp8 \
   <tr>
    <td><strong>Benchmark</strong>
    </td>
+   <td><strong>Llama-3.1-405B-Instruct </strong>
    </td>
+   <td><strong>Llama-3.1-405B-Instruct-MXFP4(this model)</strong>
    </td>
    <td><strong>Recovery</strong>
    </td>