File size: 3,498 Bytes
1a53b9a 6ee9f1c 1a53b9a bd72b03 0803e06 1a53b9a e864314 56c501d 1a53b9a 9dd7b83 1a53b9a b7a882b 1a53b9a 00fb3f3 1a53b9a 7b7373c 1a53b9a ed6154a 1a53b9a adaf803 1a53b9a 00fb3f3 1a53b9a 00fb3f3 1a53b9a 17598e6 1a53b9a 0164045 1a53b9a 0164045 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
---
license: llama3.1
base_model:
- meta-llama/Llama-3.1-405B-Instruct
---
# Model Overview
- **Model Architecture:** Llama-3.1
- **Input:** Text
- **Output:** Text
- **Supported Hardware Microarchitecture:** AMD MI350/MI355
- **ROCm**: 7.0
- **Preferred Operating System(s):** Linux
- **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
- **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html)
- **Weight quantization:** OCP MXFP4, Static
- **Activation quantization:** OCP MXFP4, Dynamic
- **KV cache quantization:** OCP FP8, Static
- **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)
This model was built with Meta Llama by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization.
# Model Quantization
The model was quantized from [meta-llama/Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). Weights and activations were quantized to MXFP4, and KV caches were quantized to FP8. The AutoSmoothQuant algorithm was applied to enhance accuracy during quantization.
**Quantization scripts:**
```
cd Quark/examples/torch/language_modeling/llm_ptq/
python3 quantize_quark.py --model_dir "meta-llama/Llama-3.1-405B-Instruct" \
--model_attn_implementation "sdpa" \
--quant_scheme w_mxfp4_a_mxfp4 \
--group_size 32 \
--kv_cache_dtype fp8 \
--quant_algo autosmoothquant \
--min_kv_scale 1.0 \
--model_export hf_format \
--output_dir amd/Llama-3.1-405B-Instruct-MXFP4 \
--multi_gpu
```
# Deployment
### Use with vLLM
This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.
## Evaluation
The model was evaluated on MMLU and GSM8K_COT.
Evaluation was conducted using the framework [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and the vLLM engine.
### Accuracy
<table>
<tr>
<td><strong>Benchmark</strong>
</td>
<td><strong>Llama-3.1-405B-Instruct </strong>
</td>
<td><strong>Llama-3.1-405B-Instruct-MXFP4(this model)</strong>
</td>
<td><strong>Recovery</strong>
</td>
</tr>
<tr>
<td>MMLU (5-shot)
</td>
<td>87.63
</td>
<td>86.62
</td>
<td>98.85%
</td>
</tr>
<tr>
<td>GSM8K_COT (8-shot, strict-match)
</td>
<td>96.51
</td>
<td>96.06
</td>
<td>99.53%
</td>
</tr>
</table>
### Reproduction
The results were obtained using the following commands:
#### MMLU
```
lm_eval \
--model vllm \
--model_args pretrained="amd/Llama-3.1-405B-Instruct-MXFP4-Preview",gpu_memory_utilization=0.85,tensor_parallel_size=8,kv_cache_dtype='fp8' \
--tasks mmlu_llama \
--fewshot_as_multiturn \
--apply_chat_template \
--num_fewshot 5 \
--batch_size auto
```
#### GSM8K_COT
```
lm_eval \
--model vllm \
--model_args pretrained="amd/Llama-3.1-405B-Instruct-MXFP4-Preview",gpu_memory_utilization=0.85,tensor_parallel_size=8,kv_cache_dtype='fp8' \
--tasks gsm8k_llama \
--fewshot_as_multiturn \
--apply_chat_template \
--num_fewshot 8 \
--batch_size auto
```
# License
Modifications copyright(c) 2024 Advanced Micro Devices,Inc. All rights reserved. |