update evaluation section and remove extra space
Browse files
README.md
CHANGED
@@ -164,7 +164,7 @@ You can specify custom instruction through the system prompt while controlling w
|
|
164 |
```python
|
165 |
prompt = "Give me a brief explanation of gravity in simple terms."
|
166 |
messages = [
|
167 |
-
{"role": "system", "content": "Speak like a pirate
|
168 |
{"role": "user", "content": prompt}
|
169 |
]
|
170 |
|
@@ -179,11 +179,42 @@ For local inference, you can use `llama.cpp`, `ONNX`, `MLX` and `MLC`. You can f
|
|
179 |
|
180 |
## Evaluation
|
181 |
|
182 |
-
In this section, we report the evaluation results of SmolLM3 base model. All evaluations are zero-shot unless stated otherwise, and we use [lighteval](https://github.com/huggingface/lighteval) to run them.
|
183 |
|
184 |
We highlight the best score in bold and underline the second-best score.
|
185 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
186 |
## Base Pre-Trained Model
|
|
|
187 |
|
188 |
### English benchmarks
|
189 |
Note: All evaluations are zero-shot unless stated otherwise.
|
@@ -212,7 +243,6 @@ Note: All evaluations are zero-shot unless stated otherwise.
|
|
212 |
### Multilingual benchmarks
|
213 |
|
214 |
|
215 |
-
|
216 |
| Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |
|
217 |
|---------|--------|---------------------|------------|--------------|------------------|---------------|
|
218 |
| Main supported languages | | | | | | | |
|
@@ -251,35 +281,6 @@ The model has also been trained on Arabic (standard), Chinese and Russian data,
|
|
251 |
| | Global MMLU (CF) | <u>36.51</u> | 32.47 | 34.52 | 34.83 | **38.80** |
|
252 |
| | Flores-200 (5-shot) | 47.13 | 48.74 | 50.74 | <u>54.70</u> | **60.53** |
|
253 |
|
254 |
-
|
255 |
-
## Instruction Model
|
256 |
-
|
257 |
-
### No Extended Thinking
|
258 |
-
Evaluation results of non reasoning models and reasoning models in no thinking mode. We highlight the best and second-best scores in bold.
|
259 |
-
| Category | Metric | SmoLLM3-3B | Qwen2.5-3B | Llama3.1-3B | Qwen3-1.7B | Qwen3-4B |
|
260 |
-
|---------|--------|------------|------------|-------------|------------|----------|
|
261 |
-
| High school math competition | AIME 2025 | <u>9.3</u> | 2.9 | 0.3 | 8.0 | **17.1** |
|
262 |
-
| Math problem-solving | GSM-Plus | 72.8 | <u>74.1</u> | 59.2 | 68.3 | **82.1** |
|
263 |
-
| Competitive programming | LiveCodeBench v4 | <u>15.2</u> | 10.5 | 3.4 | 15.0 | **24.9** |
|
264 |
-
| Graduate-level reasoning | GPQA Diamond | <u>35.7</u> | 32.2 | 29.4 | 31.8 | **44.4** |
|
265 |
-
| Instruction following | IFEval | **76.7** | 65.6 | 71.6 | <u>74.0</u> | 68.9 |
|
266 |
-
| Alignment | MixEval Hard | 26.9 | <u>27.6</u> | 24.9 | 24.3 | **31.6** |
|
267 |
-
| Knowledge | MMLU-Pro | 45.0 | 41.9 | 36.6 | <u>45.6</u> | **60.9** |
|
268 |
-
| Multilingual Q&A | Global MMLU | <u>53.5</u> | 50.54 | 46.8 | 49.5 | **65.1** |
|
269 |
-
|
270 |
-
### Extended Thinking
|
271 |
-
Evaluation results in reasoning mode for SmolLM3 and Qwen3 models:
|
272 |
-
| Category | Metric | SmoLLM3-3B | Qwen3-1.7B | Qwen3-4B |
|
273 |
-
|---------|--------|------------|------------|----------|
|
274 |
-
| High school math competition | AIME 2025 | <u>36.7</u> | 30.7 | **58.8** |
|
275 |
-
| Math problem-solving | GSM-Plus | <u>83.4</u> | 79.4 | **88.2** |
|
276 |
-
| Competitive programming | LiveCodeBench v4 | 30.0 | <u>34.4</u> | **52.9** |
|
277 |
-
| Graduate-level reasoning | GPQA Diamond | <u>41.7</u> | 39.9 | **55.3** |
|
278 |
-
| Instruction following | IFEval | 71.2 | <u>74.2</u> | **85.4** |
|
279 |
-
| Alignment | MixEval Hard | 30.8 | <u>33.9</u> | **38.0** |
|
280 |
-
| Knowledge | MMLU-Pro | <u>58.4</u> | 57.8 | **70.2** |
|
281 |
-
| Multilingual Q&A | Global MMLU | <u>64.1</u> | 62.3 | **73.3** |
|
282 |
-
|
283 |
## Training
|
284 |
|
285 |
### Model
|
|
|
164 |
```python
|
165 |
prompt = "Give me a brief explanation of gravity in simple terms."
|
166 |
messages = [
|
167 |
+
{"role": "system", "content": "Speak like a pirate./think"},
|
168 |
{"role": "user", "content": prompt}
|
169 |
]
|
170 |
|
|
|
179 |
|
180 |
## Evaluation
|
181 |
|
182 |
+
In this section, we report the evaluation results of SmolLM3 base model. All evaluations are zero-shot unless stated otherwise, and we use [lighteval](https://github.com/huggingface/lighteval) to run them.
|
183 |
|
184 |
We highlight the best score in bold and underline the second-best score.
|
185 |
|
186 |
+
## Instruction Model
|
187 |
+
|
188 |
+
### No Extended Thinking
|
189 |
+
Evaluation results of non reasoning models and reasoning models in no thinking mode. We highlight the best and second-best scores in bold.
|
190 |
+
| Category | Metric | SmoLLM3-3B | Qwen2.5-3B | Llama3.1-3B | Qwen3-1.7B | Qwen3-4B |
|
191 |
+
|---------|--------|------------|------------|-------------|------------|----------|
|
192 |
+
| High school math competition | AIME 2025 | <u>9.3</u> | 2.9 | 0.3 | 8.0 | **17.1** |
|
193 |
+
| Math problem-solving | GSM-Plus | 72.8 | <u>74.1</u> | 59.2 | 68.3 | **82.1** |
|
194 |
+
| Competitive programming | LiveCodeBench v4 | <u>15.2</u> | 10.5 | 3.4 | 15.0 | **24.9** |
|
195 |
+
| Graduate-level reasoning | GPQA Diamond | <u>35.7</u> | 32.2 | 29.4 | 31.8 | **44.4** |
|
196 |
+
| Instruction following | IFEval | **76.7** | 65.6 | 71.6 | <u>74.0</u> | 68.9 |
|
197 |
+
| Alignment | MixEval Hard | 26.9 | <u>27.6</u> | 24.9 | 24.3 | **31.6** |
|
198 |
+
| Tool Calling | BFCL| <u>92.3</u> | - | <u>92.3</u> * | 89.5 | **95.0** |
|
199 |
+
| Multilingual Q&A | Global MMLU | <u>53.5</u> | 50.54 | 46.8 | 49.5 | **65.1** |
|
200 |
+
(*): this is a tool calling finetune
|
201 |
+
|
202 |
+
### Extended Thinking
|
203 |
+
Evaluation results in reasoning mode for SmolLM3 and Qwen3 models:
|
204 |
+
| Category | Metric | SmoLLM3-3B | Qwen3-1.7B | Qwen3-4B |
|
205 |
+
|---------|--------|------------|------------|----------|
|
206 |
+
| High school math competition | AIME 2025 | <u>36.7</u> | 30.7 | **58.8** |
|
207 |
+
| Math problem-solving | GSM-Plus | <u>83.4</u> | 79.4 | **88.2** |
|
208 |
+
| Competitive programming | LiveCodeBench v4 | 30.0 | <u>34.4</u> | **52.9** |
|
209 |
+
| Graduate-level reasoning | GPQA Diamond | <u>41.7</u> | 39.9 | **55.3** |
|
210 |
+
| Instruction following | IFEval | 71.2 | <u>74.2</u> | **85.4** |
|
211 |
+
| Alignment | MixEval Hard | 30.8 | <u>33.9</u> | **38.0** |
|
212 |
+
| Tool Calling | BFCL | <u>88.8</u> | <u>88.8</u> | **95.5** |
|
213 |
+
| Multilingual Q&A | Global MMLU | <u>64.1</u> | 62.3 | **73.3** |
|
214 |
+
|
215 |
+
|
216 |
## Base Pre-Trained Model
|
217 |
+
For Ruler 64k evaluation, we apply YaRN to the Qwen models with 32k context to extrapolate the context length.
|
218 |
|
219 |
### English benchmarks
|
220 |
Note: All evaluations are zero-shot unless stated otherwise.
|
|
|
243 |
### Multilingual benchmarks
|
244 |
|
245 |
|
|
|
246 |
| Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |
|
247 |
|---------|--------|---------------------|------------|--------------|------------------|---------------|
|
248 |
| Main supported languages | | | | | | | |
|
|
|
281 |
| | Global MMLU (CF) | <u>36.51</u> | 32.47 | 34.52 | 34.83 | **38.80** |
|
282 |
| | Flores-200 (5-shot) | 47.13 | 48.74 | 50.74 | <u>54.70</u> | **60.53** |
|
283 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
284 |
## Training
|
285 |
|
286 |
### Model
|