alexmarques commited on
Commit
4689c1b
·
verified ·
1 Parent(s): 9f62e45

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +242 -0
README.md ADDED
@@ -0,0 +1,242 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ license_link: https://huggingface.co/Qwen/Qwen2.5-7B/blob/main/LICENSE
4
+ language:
5
+ - en
6
+ pipeline_tag: text-generation
7
+ base_model: Qwen/Qwen2.5-7B-Instruct
8
+ tags:
9
+ - chat
10
+ - neuralmagic
11
+ - llmcompressor
12
+ - int8
13
+ ---
14
+
15
+ # Qwen2.5-7B-Instruct-quantized.w8a8
16
+
17
+ ## Model Overview
18
+ - **Model Architecture:** Qwen2
19
+ - **Input:** Text
20
+ - **Output:** Text
21
+ - **Model Optimizations:**
22
+ - **Activation quantization:** INT8
23
+ - **Weight quantization:** INT8
24
+ - **Intended Use Cases:** Intended for commercial and research use multiple languages. Similarly to [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B), this models is intended for assistant-like chat.
25
+ - **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws).
26
+ - **Release Date:** 10/09/2024
27
+ - **Version:** 1.0
28
+ - **License(s):** [apache-2.0](https://huggingface.co/Qwen/Qwen2.5-7B/blob/main/LICENSE)
29
+ - **Model Developers:** Neural Magic
30
+
31
+ ### Model Optimizations
32
+
33
+ This model was obtained by quantizing activations and weights of [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) to INT8 data type.
34
+ This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
35
+ Weight quantization also reduces disk size requirements by approximately 50%.
36
+
37
+ Only weights and activations of the linear operators within transformers blocks are quantized.
38
+ Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme.
39
+ A combination of the [SmoothQuant](https://arxiv.org/abs/2211.10438) and [GPTQ](https://arxiv.org/abs/2210.17323) algorithms is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
40
+
41
+ ## Deployment
42
+
43
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
44
+
45
+ ```python
46
+ from vllm import LLM, SamplingParams
47
+ from transformers import AutoTokenizer
48
+
49
+ model_id = "RedHatAI/Qwen2.5-7B-Instruct-quantized.w8a8"
50
+ number_gpus = 1
51
+ max_model_len = 8192
52
+
53
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
54
+
55
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
56
+
57
+ messages = [
58
+ {"role": "user", "content": "Give me a short introduction to large language model."},
59
+ ]
60
+
61
+ prompts = tokenizer.apply_chat_template(messages, tokenize=False)
62
+
63
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus, max_model_len=max_model_len)
64
+
65
+ outputs = llm.generate(prompts, sampling_params)
66
+
67
+ generated_text = outputs[0].outputs[0].text
68
+ print(generated_text)
69
+ ```
70
+
71
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
72
+
73
+ ## Creation
74
+
75
+ <details>
76
+ <summary>Creation details</summary>
77
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
78
+
79
+
80
+ ```python
81
+ from transformers import AutoModelForCausalLM, AutoTokenizer
82
+ from llmcompressor.modifiers.quantization import GPTQModifier
83
+ from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
84
+ from llmcompressor.transformers import oneshot
85
+ from datasets import load_dataset
86
+
87
+ # Load model
88
+ model_stub = "Qwen/Qwen2.5-7B-Instruct"
89
+ model_name = model_stub.split("/")[-1]
90
+
91
+ num_samples = 512
92
+ max_seq_len = 8192
93
+
94
+ tokenizer = AutoTokenizer.from_pretrained(model_stub)
95
+
96
+ model = AutoModelForCausalLM.from_pretrained(
97
+ model_stub,
98
+ device_map="auto",
99
+ torch_dtype="auto",
100
+ )
101
+
102
+ def preprocess_fn(example):
103
+ return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)}
104
+
105
+ ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
106
+ ds = ds.map(preprocess_fn)
107
+
108
+ # Configure the quantization algorithm and scheme
109
+ recipe = [
110
+ SmoothQuantModifier(
111
+ smoothing_strength=0.8,
112
+ mappings=[
113
+ [["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
114
+ [["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"],
115
+ [["re:.*down_proj"], "re:.*up_proj"],
116
+ ],
117
+ ),
118
+ GPTQModifier(
119
+ ignore=["lm_head"],
120
+ sequential_targets=["Qwen2DecoderLayer"],
121
+ dampening_frac=0.01,
122
+ targets="Linear",
123
+ scheme="W8A8",
124
+ ),
125
+ ]
126
+
127
+ # Apply quantization
128
+ oneshot(
129
+ model=model,
130
+ dataset=ds,
131
+ recipe=recipe,
132
+ max_seq_length=max_seq_len,
133
+ num_calibration_samples=num_samples,
134
+ )
135
+
136
+ # Save to disk in compressed-tensors format
137
+ save_path = model_name + "-quantized.w8a8"
138
+ model.save_pretrained(save_path)
139
+ tokenizer.save_pretrained(save_path)
140
+ print(f"Model and tokenizer saved to: {save_path}")
141
+ ```
142
+ </details>
143
+
144
+ ## Evaluation
145
+
146
+ The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/387Bbd54bc621086e05aa1b030d8d4d5635b25e6) (commit 387Bbd54bc621086e05aa1b030d8d4d5635b25e6) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command:
147
+ ```
148
+ lm_eval \
149
+ --model vllm \
150
+ --model_args pretrained="neuralmagic/Qwen2.5-7B-Instruct-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.5,max_model_len=4096,enable_chunk_prefill=True,tensor_parallel_size=1 \
151
+ --apply_chat_template \
152
+ --fewshot_as_multiturn \
153
+ --tasks openllm \
154
+ --batch_size auto
155
+ ```
156
+
157
+ ### Accuracy
158
+
159
+ #### Open LLM Leaderboard evaluation scores
160
+ <table>
161
+ <tr>
162
+ <th>Benchmark
163
+ </th>
164
+ <th>Qwen2.5-7B-Instruct
165
+ </th>
166
+ <th>Qwen2.5-7B-Instruct-quantized.w8a8<br>(this model)
167
+ </th>
168
+ <th>Recovery
169
+ </th>
170
+ </tr>
171
+ <tr>
172
+ <td>MMLU (5-shot)
173
+ </td>
174
+ <td>74.24
175
+ </td>
176
+ <td>73.87
177
+ </td>
178
+ <td>99.5%
179
+ </td>
180
+ </tr>
181
+ <tr>
182
+ <td>ARC Challenge (25-shot)
183
+ </td>
184
+ <td>63.40
185
+ </td>
186
+ <td>63.23
187
+ </td>
188
+ <td>99.7%
189
+ </td>
190
+ </tr>
191
+ <tr>
192
+ <td>GSM-8K (5-shot, strict-match)
193
+ </td>
194
+ <td>80.36
195
+ </td>
196
+ <td>80.74
197
+ </td>
198
+ <td>100.5%
199
+ </td>
200
+ </tr>
201
+ <tr>
202
+ <td>Hellaswag (10-shot)
203
+ </td>
204
+ <td>81.52
205
+ </td>
206
+ <td>81.06
207
+ </td>
208
+ <td>99.4%
209
+ </td>
210
+ </tr>
211
+ <tr>
212
+ <td>Winogrande (5-shot)
213
+ </td>
214
+ <td>74.66
215
+ </td>
216
+ <td>74.82
217
+ </td>
218
+ <td>100.2%
219
+ </td>
220
+ </tr>
221
+ <tr>
222
+ <td>TruthfulQA (0-shot, mc2)
223
+ </td>
224
+ <td>64.76
225
+ </td>
226
+ <td>64.58
227
+ </td>
228
+ <td>99.7%
229
+ </td>
230
+ </tr>
231
+ <tr>
232
+ <td><strong>Average</strong>
233
+ </td>
234
+ <td><strong>73.16</strong>
235
+ </td>
236
+ <td><strong>73.05</strong>
237
+ </td>
238
+ <td><strong>99.4%</strong>
239
+ </td>
240
+ </tr>
241
+ </table>
242
+