File size: 3,498 Bytes
1a53b9a
 
 
 
 
 
 
 
6ee9f1c
1a53b9a
 
bd72b03
0803e06
1a53b9a
 
 
e864314
 
56c501d
1a53b9a
 
9dd7b83
1a53b9a
 
 
b7a882b
1a53b9a
 
 
 
00fb3f3
1a53b9a
 
7b7373c
1a53b9a
 
 
 
ed6154a
1a53b9a
 
 
 
adaf803
 
 
1a53b9a
 
 
 
 
 
 
 
 
 
 
 
00fb3f3
1a53b9a
00fb3f3
1a53b9a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17598e6
1a53b9a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0164045
1a53b9a
0164045
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
---
license: llama3.1
base_model:
- meta-llama/Llama-3.1-405B-Instruct
---

# Model Overview

- **Model Architecture:** Llama-3.1
  - **Input:** Text
  - **Output:** Text
- **Supported Hardware Microarchitecture:** AMD MI350/MI355
- **ROCm**: 7.0
- **Preferred Operating System(s):** Linux
- **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
- **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html)
  - **Weight quantization:** OCP MXFP4, Static
  - **Activation quantization:** OCP MXFP4, Dynamic
  - **KV cache quantization:** OCP FP8, Static 
- **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)

This model was built with Meta Llama by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization.

# Model Quantization

The model was quantized from [meta-llama/Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). Weights and activations were quantized to MXFP4, and KV caches were quantized to FP8. The AutoSmoothQuant algorithm was applied to enhance accuracy during quantization.

**Quantization scripts:**
```
cd Quark/examples/torch/language_modeling/llm_ptq/
python3 quantize_quark.py --model_dir "meta-llama/Llama-3.1-405B-Instruct" \
                          --model_attn_implementation "sdpa" \
                          --quant_scheme w_mxfp4_a_mxfp4 \
                          --group_size 32 \
                          --kv_cache_dtype fp8 \
                          --quant_algo autosmoothquant \
                          --min_kv_scale 1.0 \
                          --model_export hf_format \
                          --output_dir amd/Llama-3.1-405B-Instruct-MXFP4 \
                          --multi_gpu
```

# Deployment
### Use with vLLM

This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.

## Evaluation

The model was evaluated on MMLU and GSM8K_COT.
Evaluation was conducted using the framework [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and the vLLM engine.

### Accuracy

<table>
  <tr>
   <td><strong>Benchmark</strong>
   </td>
   <td><strong>Llama-3.1-405B-Instruct </strong>
   </td>
   <td><strong>Llama-3.1-405B-Instruct-MXFP4(this model)</strong>
   </td>
   <td><strong>Recovery</strong>
   </td>
  </tr>
  <tr>
   <td>MMLU (5-shot)
   </td>
   <td>87.63
   </td>
   <td>86.62
   </td>
   <td>98.85%
   </td>
  </tr>
  <tr>
   <td>GSM8K_COT (8-shot, strict-match)
   </td>
   <td>96.51
   </td>
   <td>96.06
   </td>
   <td>99.53%
   </td>
  </tr>
</table>


### Reproduction

The results were obtained using the following commands:

#### MMLU
```
lm_eval \
    --model vllm \
    --model_args pretrained="amd/Llama-3.1-405B-Instruct-MXFP4-Preview",gpu_memory_utilization=0.85,tensor_parallel_size=8,kv_cache_dtype='fp8' \
    --tasks mmlu_llama \
    --fewshot_as_multiturn \
    --apply_chat_template \
    --num_fewshot 5 \
    --batch_size auto
```

#### GSM8K_COT
```
lm_eval \
    --model vllm \
    --model_args pretrained="amd/Llama-3.1-405B-Instruct-MXFP4-Preview",gpu_memory_utilization=0.85,tensor_parallel_size=8,kv_cache_dtype='fp8' \
    --tasks gsm8k_llama \
    --fewshot_as_multiturn \
    --apply_chat_template \
    --num_fewshot 8 \
    --batch_size auto
```

# License

Modifications copyright(c) 2024 Advanced Micro Devices,Inc. All rights reserved.