alexmarques commited on
Commit
b8783f0
·
verified ·
1 Parent(s): a98a66a

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +438 -0
README.md ADDED
@@ -0,0 +1,438 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - fr
5
+ - de
6
+ - es
7
+ - pt
8
+ - it
9
+ - ja
10
+ - ko
11
+ - ru
12
+ - zh
13
+ - ar
14
+ - fa
15
+ - id
16
+ - ms
17
+ - ne
18
+ - pl
19
+ - ro
20
+ - sr
21
+ - sv
22
+ - tr
23
+ - uk
24
+ - vi
25
+ - hi
26
+ - bn
27
+ license: apache-2.0
28
+ library_name: vllm
29
+ base_model:
30
+ - mistralai/Mistral-Small-3.1-24B-Instruct-2503
31
+ pipeline_tag: image-text-to-text
32
+ tags:
33
+ - neuralmagic
34
+ - redhat
35
+ - llmcompressor
36
+ - quantized
37
+ - int4
38
+ ---
39
+
40
+ # Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16
41
+
42
+ ## Model Overview
43
+ - **Model Architecture:** Mistral3ForConditionalGeneration
44
+ - **Input:** Text / Image
45
+ - **Output:** Text
46
+ - **Model Optimizations:**
47
+ - **Weight quantization:** INT4
48
+ - **Intended Use Cases:** It is ideal for:
49
+ - Fast-response conversational agents.
50
+ - Low-latency function calling.
51
+ - Subject matter experts via fine-tuning.
52
+ - Local inference for hobbyists and organizations handling sensitive data.
53
+ - Programming and math reasoning.
54
+ - Long document understanding.
55
+ - Visual understanding.
56
+ - **Out-of-scope:** This model is not specifically designed or evaluated for all downstream purposes, thus:
57
+ 1. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios.
58
+ 2. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case, including the model’s focus on English.
59
+ 3. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.
60
+ - **Release Date:** 04/15/2025
61
+ - **Version:** 1.0
62
+ - **Model Developers:** RedHat (Neural Magic)
63
+
64
+
65
+ ### Model Optimizations
66
+
67
+ This model was obtained by quantizing the weights of [Mistral-Small-3.1-24B-Instruct-2503](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503) to INT4 data type.
68
+ This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.
69
+
70
+ Only the weights of the linear operators within transformers blocks are quantized.
71
+ Weights are quantized using a symmetric per-group scheme, with group size 128.
72
+ The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
73
+
74
+
75
+ ## Deployment
76
+
77
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
78
+
79
+ ```python
80
+ from vllm import LLM, SamplingParams
81
+ from transformers import AutoTokenizer
82
+
83
+ model_id = "RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16"
84
+ number_gpus = 1
85
+
86
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
87
+
88
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
89
+
90
+ prompt = "Give me a short introduction to large language model."
91
+
92
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
93
+
94
+ outputs = llm.generate(prompt, sampling_params)
95
+
96
+ generated_text = outputs[0].outputs[0].text
97
+ print(generated_text)
98
+ ```
99
+
100
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
101
+
102
+ ## Creation
103
+
104
+ <details>
105
+ <summary>Creation details</summary>
106
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
107
+
108
+
109
+ ```python
110
+ from transformers import AutoProcessor
111
+ from llmcompressor.modifiers.quantization import GPTQModifier
112
+ from llmcompressor.transformers import oneshot
113
+ from llmcompressor.transformers.tracing import TraceableMistral3ForConditionalGeneration
114
+ from PIL import Image
115
+ import io
116
+
117
+ # Load model
118
+ model_stub = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
119
+ model_name = model_stub.split("/")[-1]
120
+
121
+ num_text_samples = 1024
122
+ num_vison_samples = 1024
123
+ max_seq_len = 8192
124
+
125
+ processor = AutoProcessor.from_pretrained(model_stub)
126
+
127
+ model = TraceableMistral3ForConditionalGeneration.from_pretrained(
128
+ model_stub,
129
+ device_map="auto",
130
+ torch_dtype="auto",
131
+ )
132
+
133
+ # Text-only data subset
134
+ def preprocess_text(example):
135
+ input = {
136
+ "text": processor.apply_chat_template(
137
+ example["messages"],
138
+ add_generation_prompt=False,
139
+ ),
140
+ "images" = None,
141
+ }
142
+ tokenized_input = processor(**input, max_length=max_seq_len, truncation=True)
143
+ tokenized_input["pixel_values"] = tokenized_input.get("pixel_values", None)
144
+ tokenized_input["image_sizes"] = tokenized_input.get("image_sizes", None)
145
+
146
+ dst = load_dataset("neuralmagic/calibration", name="LLM", split="train").select(num_text_samples)
147
+ dst = dst.map(preprocess_text, remove_columns=dst.column_names)
148
+
149
+ # Text + vision data subset
150
+ def preprocess_vision(example):
151
+ messages = []
152
+ image = None
153
+ for message in example["messages"]:
154
+ message_content = []
155
+ for content in message["content"]
156
+ if content["type"] == "text":
157
+ message_content = {"type": "text", "text": content["text"]}
158
+ else:
159
+ message_content = {"type": "image"}}
160
+ image = Image.open(io.BytesIO(content["image"]))
161
+
162
+ messages.append(
163
+ {
164
+ "role": message["role"],
165
+ "content": message_content,
166
+ }
167
+ )
168
+
169
+ input = {
170
+ "text": processor.apply_chat_template(
171
+ messages,
172
+ add_generation_prompt=False,
173
+ ),
174
+ "images" = image,
175
+ }
176
+ tokenized_input = processor(**input, max_length=max_seq_len, truncation=True)
177
+ tokenized_input["pixel_values"] = tokenized_input.get("pixel_values", None)
178
+ tokenized_input["image_sizes"] = tokenized_input.get("image_sizes", None)
179
+
180
+ dsv = load_dataset("neuralmagic/calibration", name="VLLM", split="train").select(num_vision_samples)
181
+ dsv = dsv.map(preprocess_vision, remove_columns=dsv.column_names)
182
+
183
+ # Interleave subsets
184
+ ds = interleave_datasets((dsv, dst))
185
+
186
+ # Configure the quantization algorithm and scheme
187
+ recipe = GPTQModifier(
188
+ ignore=["language_model.lm_head", "re:vision_tower.*", "re:multi_modal_projector.*"]
189
+ sequential_targets=["MistralDecoderLayer"]
190
+ dampening_frac=0.01
191
+ targets="Linear",
192
+ scheme="W4A16",
193
+ )
194
+
195
+ # Define data collator
196
+ def data_collator(batch):
197
+ import torch
198
+ assert len(batch) == 1
199
+ collated = {}
200
+ for k, v in batch[0].items():
201
+ if v is None:
202
+ continue
203
+ if k == "input_ids":
204
+ collated[k] = torch.LongTensor(v)
205
+ elif k == "pixel_values":
206
+ collated[k] = torch.tensor(v, dtype=torch.bfloat16)
207
+ else:
208
+ collated[k] = torch.tensor(v)
209
+ return collated
210
+
211
+
212
+ # Apply quantization
213
+ oneshot(
214
+ model=model,
215
+ dataset=ds,
216
+ recipe=recipe,
217
+ max_seq_length=max_seq_len,
218
+ data_collator=data_collator,
219
+ )
220
+
221
+ # Save to disk in compressed-tensors format
222
+ save_path = model_name + "-quantized.w4a16
223
+ model.save_pretrained(save_path)
224
+ tokenizer.save_pretrained(save_path)
225
+ print(f"Model and tokenizer saved to: {save_path}")
226
+ ```
227
+ </details>
228
+
229
+
230
+
231
+ ## Evaluation
232
+
233
+ The model was evaluated on the OpenLLM leaderboard tasks (version 1), MMLU-pro, GPQA, HumanEval and MBPP.
234
+ Non-coding tasks were evaluated with [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), whereas coding tasks were evaluated with a fork of [evalplus](https://github.com/neuralmagic/evalplus).
235
+ [vLLM](https://docs.vllm.ai/en/stable/) is used as the engine in all cases.
236
+
237
+ <details>
238
+ <summary>Evaluation details</summary>
239
+
240
+ **MMLU**
241
+ ```
242
+ lm_eval \
243
+ --model vllm \
244
+ --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
245
+ --tasks mmlu \
246
+ --num_fewshot 5 \
247
+ --apply_chat_template\
248
+ --fewshot_as_multiturn \
249
+ --batch_size auto
250
+ ```
251
+
252
+ **ARC Challenge**
253
+ ```
254
+ lm_eval \
255
+ --model vllm \
256
+ --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
257
+ --tasks arc_challenge \
258
+ --num_fewshot 25 \
259
+ --apply_chat_template\
260
+ --fewshot_as_multiturn \
261
+ --batch_size auto
262
+ ```
263
+
264
+ **GSM8k**
265
+ ```
266
+ lm_eval \
267
+ --model vllm \
268
+ --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.9,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
269
+ --tasks gsm8k \
270
+ --num_fewshot 8 \
271
+ --apply_chat_template\
272
+ --fewshot_as_multiturn \
273
+ --batch_size auto
274
+ ```
275
+
276
+ **Hellaswag**
277
+ ```
278
+ lm_eval \
279
+ --model vllm \
280
+ --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
281
+ --tasks hellaswag \
282
+ --num_fewshot 10 \
283
+ --apply_chat_template\
284
+ --fewshot_as_multiturn \
285
+ --batch_size auto
286
+ ```
287
+
288
+ **Winogrande**
289
+ ```
290
+ lm_eval \
291
+ --model vllm \
292
+ --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
293
+ --tasks winogrande \
294
+ --num_fewshot 5 \
295
+ --apply_chat_template\
296
+ --fewshot_as_multiturn \
297
+ --batch_size auto
298
+ ```
299
+
300
+ **TruthfulQA**
301
+ ```
302
+ lm_eval \
303
+ --model vllm \
304
+ --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
305
+ --tasks truthfulqa \
306
+ --num_fewshot 0 \
307
+ --apply_chat_template\
308
+ --batch_size auto
309
+ ```
310
+
311
+ **MMLU-pro**
312
+ ```
313
+ lm_eval \
314
+ --model vllm \
315
+ --model_args pretrained="RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16",dtype=auto,gpu_memory_utilization=0.5,max_model_len=8192,enable_chunk_prefill=True,tensor_parallel_size=2 \
316
+ --tasks mmlu_pro \
317
+ --num_fewshot 5 \
318
+ --apply_chat_template\
319
+ --fewshot_as_multiturn \
320
+ --batch_size auto
321
+ ```
322
+
323
+ **Coding**
324
+
325
+ The commands below can be used for mbpp by simply replacing the dataset name.
326
+
327
+ *Generation*
328
+ ```
329
+ python3 codegen/generate.py \
330
+ --model RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16 \
331
+ --bs 16 \
332
+ --temperature 0.2 \
333
+ --n_samples 50 \
334
+ --root "." \
335
+ --dataset humaneval
336
+
337
+ ```
338
+
339
+ *Sanitization*
340
+ ```
341
+ python3 evalplus/sanitize.py \
342
+ humaneval/RedHatAI--Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16_vllm_temp_0.2
343
+ ```
344
+
345
+ *Evaluation*
346
+ ```
347
+ evalplus.evaluate \
348
+ --dataset humaneval \
349
+ --samples humaneval/RedHatAI--Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16_vllm_temp_0.2-sanitized
350
+ ```
351
+ </details>
352
+
353
+ ### Accuracy
354
+
355
+ #### Open LLM Leaderboard evaluation scores
356
+ <table>
357
+ <tr>
358
+ <td><strong>Benchmark</strong>
359
+ </td>
360
+ <td><strong>Mistral-Small-3.1-24B-Instruct-2503</strong>
361
+ </td>
362
+ <td><strong>Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16<br>(this model)</strong>
363
+ </td>
364
+ <td><strong>Recovery</strong>
365
+ </td>
366
+ </tr>
367
+ <tr>
368
+ <td>MMLU (5-shot)
369
+ </td>
370
+ <td>80.67
371
+ </td>
372
+ <td>79.74
373
+ </td>
374
+ <td>98.9%
375
+ </td>
376
+ </tr>
377
+ <tr>
378
+ <td>ARC Challenge (25-shot)
379
+ </td>
380
+ <td>72.78
381
+ </td>
382
+ <td>72.18
383
+ </td>
384
+ <td>99.2%
385
+ </td>
386
+ </tr>
387
+ <tr>
388
+ <td>GSM-8K (5-shot, strict-match)
389
+ </td>
390
+ <td>65.35
391
+ </td>
392
+ <td>66.34
393
+ </td>
394
+ <td>101.5%
395
+ </td>
396
+ </tr>
397
+ <tr>
398
+ <td>Hellaswag (10-shot)
399
+ </td>
400
+ <td>83.70
401
+ </td>
402
+ <td>83.25
403
+ </td>
404
+ <td>99.5%
405
+ </td>
406
+ </tr>
407
+ <tr>
408
+ <td>Winogrande (5-shot)
409
+ </td>
410
+ <td>83.74
411
+ </td>
412
+ <td>83.43
413
+ </td>
414
+ <td>99.6%
415
+ </td>
416
+ </tr>
417
+ <tr>
418
+ <td>TruthfulQA (0-shot, mc2)
419
+ </td>
420
+ <td>70.62
421
+ </td>
422
+ <td>69.56
423
+ </td>
424
+ <td>98.5%
425
+ </td>
426
+ </tr>
427
+ <tr>
428
+ <td><strong>Average</strong>
429
+ </td>
430
+ <td><strong>76.14</strong>
431
+ </td>
432
+ <td><strong>75.75</strong>
433
+ </td>
434
+ <td><strong>99.5%</strong>
435
+ </td>
436
+ </tr>
437
+ </table>
438
+