amd
/

Llama-3.1-405B-Instruct-MXFP4-Preview

8-bit precision

Model card Files Files and versions

linzhao-amd commited on Aug 1

Commit

b7a882b

·

verified ·

1 Parent(s): 7b7373c

Update README.md

Files changed (1) hide show

README.md +2 -3

README.md CHANGED Viewed

@@ -19,12 +19,11 @@ base_model:
   - **KV cache quantization:** OCP FP8
 - **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)
-The model is the quantized version of the [Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct) model, which is an auto-regressive language model that uses an optimized transformer architecture. For more information, please check [here](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct). The MXFP4 model is quantized with [AMD-Quark](https://quark.docs.amd.com/latest/index.html).
 # Model Quantization
-This model was obtained by quantizing [Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct)'s weights and activations to MXFP4 and KV caches to FP8, using AutoSmoothQuant algorithm in [AMD-Quark](https://quark.docs.amd.com/latest/index.html).
 **Quantization scripts:**
 ```

   - **KV cache quantization:** OCP FP8
 - **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)
+This model is a quantized version of [meta-llama/Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct)，optimized using [AMD-Quark](https://quark.docs.amd.com/latest/index.html) framework with MXFP4 quantization.
 # Model Quantization
+The model was quantized from [meta-llama/Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). Weights and activations were quantized to MXFP4, and KV caches were quantized to FP8. The AutoSmoothQuant algorithm was applied to enhance accuracy during quantization.
 **Quantization scripts:**
 ```