nm-research commited on
Commit
5034416
·
verified ·
1 Parent(s): 76cd16b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +141 -36
README.md CHANGED
@@ -43,7 +43,7 @@ from transformers import AutoTokenizer
43
  from vllm import LLM, SamplingParams
44
 
45
  max_model_len, tp_size = 4096, 1
46
- model_name = "neuralmagic-ent/granite-3.1-8b-instruct-quantized.w4a16"
47
  tokenizer = AutoTokenizer.from_pretrained(model_name)
48
  llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
49
  sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
@@ -66,6 +66,8 @@ vLLM also supports OpenAI-compatible serving. See the [documentation](https://do
66
 
67
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
68
 
 
 
69
 
70
  ```bash
71
  python quantize.py --model_path ibm-granite/granite-3.1-8b-instruct --quant_path "output_dir/granite-3.1-8b-instruct-quantized.w4a16" --calib_size 1024 --dampening_frac 0.1 --observer mse --actorder static
@@ -146,16 +148,20 @@ oneshot(
146
  model.save_pretrained(SAVE_DIR, save_compressed=True)
147
  tokenizer.save_pretrained(SAVE_DIR)
148
  ```
 
149
 
150
  ## Evaluation
151
 
152
- The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard) and on [HumanEval](https://github.com/neuralmagic/evalplus), using the following commands:
153
 
 
 
 
154
  OpenLLM Leaderboard V1:
155
  ```
156
  lm_eval \
157
  --model vllm \
158
- --model_args pretrained="neuralmagic-ent/granite-3.1-8b-instruct-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
159
  --tasks openllm \
160
  --write_out \
161
  --batch_size auto \
@@ -163,11 +169,23 @@ lm_eval \
163
  --show_config
164
  ```
165
 
 
 
 
 
 
 
 
 
 
 
 
 
166
  #### HumanEval
167
  ##### Generation
168
  ```
169
  python3 codegen/generate.py \
170
- --model neuralmagic-ent/granite-3.1-8b-instruct-quantized.w4a16 \
171
  --bs 16 \
172
  --temperature 0.2 \
173
  --n_samples 50 \
@@ -177,47 +195,125 @@ python3 codegen/generate.py \
177
  ##### Sanitization
178
  ```
179
  python3 evalplus/sanitize.py \
180
- humaneval/neuralmagic-ent--granite-3.1-8b-instruct-quantized.w4a16_vllm_temp_0.2
181
  ```
182
  ##### Evaluation
183
  ```
184
  evalplus.evaluate \
185
  --dataset humaneval \
186
- --samples humaneval/neuralmagic-ent--granite-3.1-8b-instruct-quantized.w4a16_vllm_temp_0.2-sanitized
187
  ```
 
188
 
189
  ### Accuracy
190
 
191
- #### OpenLLM Leaderboard V1 evaluation scores
192
-
193
- | Metric | ibm-granite/granite-3.1-8b-instruct | neuralmagic-ent/granite-3.1-8b-instruct-quantized.w4a16 |
194
- |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
195
- | ARC-Challenge (Acc-Norm, 25-shot) | 66.81 | 66.81 |
196
- | GSM8K (Strict-Match, 5-shot) | 64.52 | 65.66 |
197
- | HellaSwag (Acc-Norm, 10-shot) | 84.18 | 83.62 |
198
- | MMLU (Acc, 5-shot) | 65.52 | 64.25 |
199
- | TruthfulQA (MC2, 0-shot) | 60.57 | 60.17 |
200
- | Winogrande (Acc, 5-shot) | 80.19 | 78.37 |
201
- | **Average Score** | **70.30** | **69.81** |
202
- | **Recovery** | **100.00** | **99.31** |
203
-
204
- #### OpenLLM Leaderboard V2 evaluation scores
205
-
206
- | Metric | ibm-granite/granite-3.1-8b-instruct | neuralmagic-ent/granite-3.1-8b-instruct-quantized.w4a16 |
207
- |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
208
- | IFEval (Inst Level Strict Acc, 0-shot)| 74.01 | 73.14 |
209
- | BBH (Acc-Norm, 3-shot) | 53.19 | 51.52 |
210
- | Math-Hard (Exact-Match, 4-shot) | 14.77 | 16.66 |
211
- | GPQA (Acc-Norm, 0-shot) | 31.76 | 29.91 |
212
- | MUSR (Acc-Norm, 0-shot) | 46.01 | 45.75 |
213
- | MMLU-Pro (Acc, 5-shot) | 35.81 | 34.23 |
214
- | **Average Score** | **42.61** | **41.87** |
215
- | **Recovery** | **100.00** | **98.26** |
216
-
217
- #### HumanEval pass@1 scores
218
- | Metric | ibm-granite/granite-3.1-8b-instruct | neuralmagic-ent/granite-3.1-8b-instruct-quantized.w4a16 |
219
- |-----------------------------------------|:---------------------------------:|:-------------------------------------------:|
220
- | HumanEval Pass@1 | 71.00 | 70.50 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
221
 
222
 
223
  ## Inference Performance
@@ -226,6 +322,15 @@ evalplus.evaluate \
226
  This model achieves up to 2.7x speedup in single-stream deployment and up to 1.5x speedup in multi-stream asynchronous deployment, depending on hardware and use-case scenario.
227
  The following performance benchmarks were conducted with [vLLM](https://docs.vllm.ai/en/latest/) version 0.6.6.post1, and [GuideLLM](https://github.com/neuralmagic/guidellm).
228
 
 
 
 
 
 
 
 
 
 
229
  ### Single-stream performance (measured with vLLM version 0.6.6.post1)
230
  <table>
231
  <tr>
 
43
  from vllm import LLM, SamplingParams
44
 
45
  max_model_len, tp_size = 4096, 1
46
+ model_name = "neuralmagic/granite-3.1-8b-instruct-quantized.w4a16"
47
  tokenizer = AutoTokenizer.from_pretrained(model_name)
48
  llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True)
49
  sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
 
66
 
67
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
68
 
69
+ <details>
70
+ <summary>Model Creation Code</summary>
71
 
72
  ```bash
73
  python quantize.py --model_path ibm-granite/granite-3.1-8b-instruct --quant_path "output_dir/granite-3.1-8b-instruct-quantized.w4a16" --calib_size 1024 --dampening_frac 0.1 --observer mse --actorder static
 
148
  model.save_pretrained(SAVE_DIR, save_compressed=True)
149
  tokenizer.save_pretrained(SAVE_DIR)
150
  ```
151
+ </details>
152
 
153
  ## Evaluation
154
 
155
+ The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard), OpenLLM Leaderboard [V2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/) and on [HumanEval](https://github.com/neuralmagic/evalplus), using the following commands:
156
 
157
+ <details>
158
+ <summary>Evaluation Commands</summary>
159
+
160
  OpenLLM Leaderboard V1:
161
  ```
162
  lm_eval \
163
  --model vllm \
164
+ --model_args pretrained="neuralmagic/granite-3.1-8b-instruct-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
165
  --tasks openllm \
166
  --write_out \
167
  --batch_size auto \
 
169
  --show_config
170
  ```
171
 
172
+ OpenLLM Leaderboard V2:
173
+ ```
174
+ lm_eval \
175
+ --model vllm \
176
+ --model_args pretrained="neuralmagic/granite-3.1-8b-instruct-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
177
+ --tasks leaderboard \
178
+ --write_out \
179
+ --batch_size auto \
180
+ --output_path output_dir \
181
+ --show_config
182
+ ```
183
+
184
  #### HumanEval
185
  ##### Generation
186
  ```
187
  python3 codegen/generate.py \
188
+ --model neuralmagic/granite-3.1-8b-instruct-quantized.w4a16 \
189
  --bs 16 \
190
  --temperature 0.2 \
191
  --n_samples 50 \
 
195
  ##### Sanitization
196
  ```
197
  python3 evalplus/sanitize.py \
198
+ humaneval/neuralmagic--granite-3.1-8b-instruct-quantized.w4a16_vllm_temp_0.2
199
  ```
200
  ##### Evaluation
201
  ```
202
  evalplus.evaluate \
203
  --dataset humaneval \
204
+ --samples humaneval/neuralmagic--granite-3.1-8b-instruct-quantized.w4a16_vllm_temp_0.2-sanitized
205
  ```
206
+ </details>
207
 
208
  ### Accuracy
209
 
210
+ <table>
211
+ <thead>
212
+ <tr>
213
+ <th>Category</th>
214
+ <th>Metric</th>
215
+ <th>ibm-granite/granite-3.1-8b-instruct</th>
216
+ <th>neuralmagic/granite-3.1-8b-instruct-quantized.w4a16</th>
217
+ <th>Recovery (%)</th>
218
+ </tr>
219
+ </thead>
220
+ <tbody>
221
+ <tr>
222
+ <td rowspan="7"><b>OpenLLM Leaderboard V1</b></td>
223
+ <td>ARC-Challenge (Acc-Norm, 25-shot)</td>
224
+ <td>66.81</td>
225
+ <td>66.81</td>
226
+ <td>100.00</td>
227
+ </tr>
228
+ <tr>
229
+ <td>GSM8K (Strict-Match, 5-shot)</td>
230
+ <td>64.52</td>
231
+ <td>65.66</td>
232
+ <td>101.77</td>
233
+ </tr>
234
+ <tr>
235
+ <td>HellaSwag (Acc-Norm, 10-shot)</td>
236
+ <td>84.18</td>
237
+ <td>83.62</td>
238
+ <td>99.33</td>
239
+ </tr>
240
+ <tr>
241
+ <td>MMLU (Acc, 5-shot)</td>
242
+ <td>65.52</td>
243
+ <td>64.25</td>
244
+ <td>98.06</td>
245
+ </tr>
246
+ <tr>
247
+ <td>TruthfulQA (MC2, 0-shot)</td>
248
+ <td>60.57</td>
249
+ <td>60.17</td>
250
+ <td>99.34</td>
251
+ </tr>
252
+ <tr>
253
+ <td>Winogrande (Acc, 5-shot)</td>
254
+ <td>80.19</td>
255
+ <td>78.37</td>
256
+ <td>97.73</td>
257
+ </tr>
258
+ <tr>
259
+ <td><b>Average Score</b></td>
260
+ <td><b>70.30</b></td>
261
+ <td><b>69.81</b></td>
262
+ <td><b>99.31</b></td>
263
+ </tr>
264
+ <tr>
265
+ <td rowspan="7"><b>OpenLLM Leaderboard V2</b></td>
266
+ <td>IFEval (Inst Level Strict Acc, 0-shot)</td>
267
+ <td>74.01</td>
268
+ <td>73.14</td>
269
+ <td>98.82</td>
270
+ </tr>
271
+ <tr>
272
+ <td>BBH (Acc-Norm, 3-shot)</td>
273
+ <td>53.19</td>
274
+ <td>51.52</td>
275
+ <td>96.86</td>
276
+ </tr>
277
+ <tr>
278
+ <td>Math-Hard (Exact-Match, 4-shot)</td>
279
+ <td>14.77</td>
280
+ <td>16.66</td>
281
+ <td>112.81</td>
282
+ </tr>
283
+ <tr>
284
+ <td>GPQA (Acc-Norm, 0-shot)</td>
285
+ <td>31.76</td>
286
+ <td>29.91</td>
287
+ <td>94.17</td>
288
+ </tr>
289
+ <tr>
290
+ <td>MUSR (Acc-Norm, 0-shot)</td>
291
+ <td>46.01</td>
292
+ <td>45.75</td>
293
+ <td>99.44</td>
294
+ </tr>
295
+ <tr>
296
+ <td>MMLU-Pro (Acc, 5-shot)</td>
297
+ <td>35.81</td>
298
+ <td>34.23</td>
299
+ <td>95.59</td>
300
+ </tr>
301
+ <tr>
302
+ <td><b>Average Score</b></td>
303
+ <td><b>42.61</b></td>
304
+ <td><b>41.87</b></td>
305
+ <td><b>98.26</b></td>
306
+ </tr>
307
+ <tr>
308
+ <td rowspan="2"><b>HumanEval</b></td>
309
+ <td>HumanEval Pass@1</td>
310
+ <td>71.00</td>
311
+ <td>70.50</td>
312
+ <td><b>99.30</b></td>
313
+ </tr>
314
+ </tbody>
315
+ </table>
316
+
317
 
318
 
319
  ## Inference Performance
 
322
  This model achieves up to 2.7x speedup in single-stream deployment and up to 1.5x speedup in multi-stream asynchronous deployment, depending on hardware and use-case scenario.
323
  The following performance benchmarks were conducted with [vLLM](https://docs.vllm.ai/en/latest/) version 0.6.6.post1, and [GuideLLM](https://github.com/neuralmagic/guidellm).
324
 
325
+ <details>
326
+ <summary>Benchmarking Command</summary>
327
+
328
+ ```
329
+ guidellm --model neuralmagic/granite-3.1-8b-instruct-quantized.w4a16 --target "http://localhost:8000/v1" --data-type emulated --data "prompt_tokens=<prompt_tokens>,generated_tokens=<generated_tokens>" --max seconds 360 --backend aiohttp_server
330
+ ```
331
+
332
+ </details>
333
+
334
  ### Single-stream performance (measured with vLLM version 0.6.6.post1)
335
  <table>
336
  <tr>