linzhao-amd commited on
Commit
1a53b9a
·
verified ·
1 Parent(s): 0f7ea55

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +154 -0
README.md ADDED
@@ -0,0 +1,154 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3.1
3
+ base_model:
4
+ - meta-llama/Llama-3.1-405B-Instruct
5
+ ---
6
+
7
+ # Model Overview
8
+
9
+ - **Model Architecture:** Meta-Llama-3.1
10
+ - **Input:** Text
11
+ - **Output:** Text
12
+ - **Supported Hardware Microarchitecture:** AMD MI300/MI350
13
+ - **Preferred Operating System(s):** Linux
14
+ - **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
15
+ - **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html)
16
+ - **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)
17
+
18
+ The model is the quantized version of the Meta-Llama 3.1-405B-Instruct model, which is an auto-regressive language model that uses an optimized transformer architecture. For more information, please check [here](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct). The MXFP4 model is quantized with [AMD-Quark](https://quark.docs.amd.com/latest/index.html).
19
+
20
+
21
+ # Model Quantization
22
+
23
+ This model was obtained by quantizing weights and activations of [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct) to MXFP4 and KV cache to FP8, using AutoSmoothQuant algorithm in AMD-Quark.
24
+
25
+ **Quantization scripts:**
26
+ ```
27
+ cd Quark/examples/torch/language_modeling/llm_ptq/
28
+ python3 quantize_quark.py --model_dir "meta-llama/Meta-Llama-3.1-405B-Instruct" \
29
+ --model_attn_implementation "sdpa" \
30
+ --quant_scheme w_mxfp4_a_mxfp4 \
31
+ --kv_cache_dtype fp8 \
32
+ --quant_algo autosmoothquant \
33
+ --min_kv_scale 1.0 \
34
+ --model_export hf_format \
35
+ --output_dir $output_path \
36
+ --multi_gpu
37
+ ```
38
+
39
+ # Deployment
40
+
41
+ ## Use with vLLM
42
+
43
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
44
+
45
+ ```python
46
+ from vllm import LLM, SamplingParams
47
+ from transformers import AutoTokenizer
48
+
49
+ model_id = "amd/Llama-3.1-405B-Instruct-MXFP4-Preview"
50
+ number_gpus = 8
51
+
52
+ sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
53
+
54
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
55
+
56
+ messages = [
57
+ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
58
+ {"role": "user", "content": "Who are you?"},
59
+ ]
60
+
61
+ prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
62
+
63
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus, max_model_len=4096)
64
+
65
+ outputs = llm.generate(prompts, sampling_params)
66
+
67
+ generated_text = outputs[0].outputs[0].text
68
+ print(generated_text)
69
+ ```
70
+
71
+ ## Evaluation
72
+
73
+ The model was evaluated on MMLU and GSM8K_COT.
74
+ Evaluation was conducted using the framework [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and the vLLM engine.
75
+
76
+ ### Accuracy
77
+
78
+ #### Open LLM Leaderboard evaluation scores
79
+ <table>
80
+ <tr>
81
+ <td><strong>Benchmark</strong>
82
+ </td>
83
+ <td><strong>Meta-Llama-3.1-405B-Instruct </strong>
84
+ </td>
85
+ <td><strong>Meta-Llama-3.1-405B-Instruct-MXFP4(this model)</strong>
86
+ </td>
87
+ <td><strong>Recovery</strong>
88
+ </td>
89
+ </tr>
90
+ <tr>
91
+ <td>MMLU (5-shot)
92
+ </td>
93
+ <td>87.63
94
+ </td>
95
+ <td>86.62
96
+ </td>
97
+ <td>98.85%
98
+ </td>
99
+ </tr>
100
+ <tr>
101
+ <td>GSM-8K-cot (8-shot, strict-match)
102
+ </td>
103
+ <td>96.51
104
+ </td>
105
+ <td>96.06
106
+ </td>
107
+ <td>99.53%
108
+ </td>
109
+ </tr>
110
+ </table>
111
+
112
+
113
+ ### Reproduction
114
+
115
+ The results were obtained using the following commands:
116
+
117
+ #### MMLU
118
+ ```
119
+ lm_eval \
120
+ --model vllm \
121
+ --model_args pretrained="amd/Llama-3.1-405B-Instruct-MXFP4-Preview",gpu_memory_utilization=0.85,tensor_parallel_size=8,kv_cache_dtype='fp8' \
122
+ --tasks mmlu_llama \
123
+ --fewshot_as_multiturn \
124
+ --apply_chat_template \
125
+ --num_fewshot 5 \
126
+ --batch_size auto
127
+ ```
128
+
129
+ #### GSM8K_COT
130
+ ```
131
+ lm_eval \
132
+ --model vllm \
133
+ --model_args pretrained="amd/Llama-3.1-405B-Instruct-MXFP4-Preview",gpu_memory_utilization=0.85,tensor_parallel_size=8,kv_cache_dtype='fp8' \
134
+ --tasks gsm8k_llama \
135
+ --fewshot_as_multiturn \
136
+ --apply_chat_template \
137
+ --num_fewshot 8 \
138
+ --batch_size auto
139
+ ```
140
+
141
+ #### License
142
+
143
+ Modifications copyright(c) 2024 Advanced Micro Devices,Inc. All rights reserved.
144
+
145
+ Licensed under the Apache License, Version 2.0 (the "License");
146
+ you may not use this file except in compliance with the License.
147
+ You may obtain a copy of the License at
148
+
149
+ http://www.apache.org/licenses/LICENSE-2.0
150
+ Unless required by applicable law or agreed to in writing, software
151
+ distributed under the License is distributed on an "AS IS" BASIS,
152
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
153
+ See the License for the specific language governing permissions and
154
+ limitations under the License.