nm-research commited on
Commit
0f1d583
·
verified ·
1 Parent(s): 689a1f2

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +216 -3
README.md CHANGED
@@ -1,3 +1,216 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: vllm
3
+ license: apache-2.0
4
+ language:
5
+ - en
6
+ - fr
7
+ - es
8
+ - it
9
+ - pt
10
+ - zh
11
+ - ar
12
+ - ru
13
+ base_model:
14
+ - HuggingFaceTB/SmolLM3-3B
15
+ tags:
16
+ - neuralmagic
17
+ - redhat
18
+ - llmcompressor
19
+ - fp8
20
+ - quantized
21
+ ---
22
+
23
+ ## Model Overview
24
+ - **Model Architecture:** SmolLM3-3B
25
+ - **Input:** Text
26
+ - **Output:** Text
27
+ - **Model Optimizations:**
28
+ - **Weight quantization:** FP8
29
+ - **Activation quantization:** FP8
30
+ - **Release Date:** 07/28/2025
31
+ - **Version:** 1.0
32
+ - **License(s):** Apache-2.0
33
+ - **Model Developers:** RedHat (Neural Magic)
34
+
35
+ ### Model Optimizations
36
+
37
+ This model was obtained by quantizing activation and weights of [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) to FP8 data type.
38
+ This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
39
+ Weight quantization also reduces disk size requirements by approximately 50%.
40
+
41
+ Only weights and activations of the linear operators within transformers blocks are quantized.
42
+ Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme.
43
+ The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization.
44
+
45
+ ## Deployment
46
+
47
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
48
+
49
+ ```python
50
+ from vllm import LLM, SamplingParams
51
+ from transformers import AutoTokenizer
52
+
53
+ model_id = "RedHatAI/SmolLM3-3B-FP8-dynamic"
54
+ number_gpus = 1
55
+
56
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
57
+
58
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
59
+
60
+ messages = [
61
+ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
62
+ {"role": "user", "content": "Who are you?"},
63
+ ]
64
+
65
+ prompts = tokenizer.apply_chat_template(messages, tokenize=False)
66
+
67
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
68
+
69
+ outputs = llm.generate(prompts, sampling_params)
70
+
71
+ generated_text = outputs[0].outputs[0].text
72
+ print(generated_text)
73
+ ```
74
+
75
+ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
76
+
77
+
78
+ ## Creation
79
+
80
+ <details>
81
+ <summary>Creation details</summary>
82
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
83
+
84
+
85
+ ```python
86
+ from transformers import AutoModelForCausalLM, AutoTokenizer
87
+ from llmcompressor.modifiers.quantization import QuantizationModifier
88
+ from llmcompressor.transformers import oneshot
89
+
90
+ # Load model
91
+ model_stub = "HuggingFaceTB/SmolLM3-3B"
92
+ model_name = model_stub.split("/")[-1]
93
+
94
+ tokenizer = AutoTokenizer.from_pretrained(model_stub)
95
+
96
+ model = AutoModelForCausalLM.from_pretrained(
97
+ model_stub,
98
+ device_map="auto",
99
+ torch_dtype="auto",
100
+ )
101
+
102
+ # Configure the quantization algorithm and scheme
103
+ recipe = QuantizationModifier(
104
+ targets="Linear",
105
+ scheme="FP8_dynamic",
106
+ ignore=["lm_head"],
107
+ )
108
+
109
+ # Apply quantization
110
+ oneshot(
111
+ model=model,
112
+ recipe=recipe,
113
+ )
114
+
115
+ # Save to disk in compressed-tensors format
116
+ save_path = model_name + "-FP8-dynamic"
117
+ model.save_pretrained(save_path)
118
+ tokenizer.save_pretrained(save_path)
119
+ print(f"Model and tokenizer saved to: {save_path}")
120
+ ```
121
+ </details>
122
+
123
+ ## Evaluation
124
+
125
+ This model was evaluated on the well-known reasoning tasks: AIME24, Math-500, and GPQA-Diamond.
126
+ In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/en/stable/) engine, and evals are collected through [LightEval](https://github.com/huggingface/lighteval) library.
127
+
128
+
129
+ <details>
130
+ <summary>Evaluation details</summary>
131
+
132
+ ```
133
+ export VLLM_WORKER_MULTIPROC_METHOD=spawn
134
+ export MODEL="RedHatAI/SmolLM3-3B-FP8-dynamic"
135
+ export MODEL_ARGS="model_name=$MODEL,dtype=auto,max_model_length=65536,gpu_memory_utilization=0.9,tensor_parallel_size=1,add_special_tokens=False,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
136
+
137
+ export TASK=aime24 # {aime24, math_500, gpqa:diamond}
138
+
139
+ lighteval vllm $MODEL_ARGS "lighteval|${TASK}|0|0" \
140
+ --use-chat-template \
141
+ --output-dir out_dir
142
+ ```
143
+ </details>
144
+
145
+ ### Accuracy
146
+
147
+ <table>
148
+ <tr>
149
+ <th>Category
150
+ </th>
151
+ <th>Benchmark
152
+ </th>
153
+ <th>HuggingFaceTB/SmolLM3-3B
154
+ </th>
155
+ <th>RedHatAI/SmolLM3-3B-FP8-dynamic<br>(this model)
156
+ </th>
157
+ <th>Recovery
158
+ </th>
159
+ </tr>
160
+ <tr>
161
+ <td rowspan="8" ><strong>Reasoning</strong>
162
+ </td>
163
+ <td>AIME24 (pass@1:64)
164
+ </td>
165
+ <td>45.31
166
+ </td>
167
+ <td>47.50
168
+ </td>
169
+ <td>104.83%
170
+ </td>
171
+ </tr>
172
+ <tr>
173
+ <td>MATH-500 (pass@1:4)
174
+ </td>
175
+ <td>89.30
176
+ </td>
177
+ <td>88.30
178
+ </td>
179
+ <td>98.88%
180
+ </td>
181
+ </tr>
182
+ <tr>
183
+ <td>GPQA-Diamond (pass@1:8)
184
+ </td>
185
+ <td>41.22
186
+ </td>
187
+ <td>40.91
188
+ </td>
189
+ <td>99.25%
190
+ </td>
191
+ </tr>
192
+ <tr>
193
+ <td>GSM-8K (CoT, 8-shot, strict-match)
194
+ </td>
195
+ <td>94.16
196
+ </td>
197
+ <td>94.92
198
+ </td>
199
+ <td>100.8%
200
+ </td>
201
+ </tr>
202
+ <tr>
203
+ <td><strong>Average</strong>
204
+ </td>
205
+ <td><strong>58.61</strong>
206
+ </td>
207
+ <td><strong>58.90</strong>
208
+ </td>
209
+ <td><strong>100.5%</strong>
210
+ </td>
211
+ </tr>
212
+ <tr>
213
+ </table>
214
+
215
+
216
+