loubnabnl HF Staff commited on
Commit
de0f86a
·
verified ·
1 Parent(s): c48d82f

update evaluation section and remove extra space

Browse files
Files changed (1) hide show
  1. README.md +33 -32
README.md CHANGED
@@ -164,7 +164,7 @@ You can specify custom instruction through the system prompt while controlling w
164
  ```python
165
  prompt = "Give me a brief explanation of gravity in simple terms."
166
  messages = [
167
- {"role": "system", "content": "Speak like a pirate. /think"},
168
  {"role": "user", "content": prompt}
169
  ]
170
 
@@ -179,11 +179,42 @@ For local inference, you can use `llama.cpp`, `ONNX`, `MLX` and `MLC`. You can f
179
 
180
  ## Evaluation
181
 
182
- In this section, we report the evaluation results of SmolLM3 base model. All evaluations are zero-shot unless stated otherwise, and we use [lighteval](https://github.com/huggingface/lighteval) to run them. For Ruler 64k evaluation, we apply YaRN to the Qwen models with 32k context to extrapolate the context length.
183
 
184
  We highlight the best score in bold and underline the second-best score.
185
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
186
  ## Base Pre-Trained Model
 
187
 
188
  ### English benchmarks
189
  Note: All evaluations are zero-shot unless stated otherwise.
@@ -212,7 +243,6 @@ Note: All evaluations are zero-shot unless stated otherwise.
212
  ### Multilingual benchmarks
213
 
214
 
215
-
216
  | Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |
217
  |---------|--------|---------------------|------------|--------------|------------------|---------------|
218
  | Main supported languages | | | | | | | |
@@ -251,35 +281,6 @@ The model has also been trained on Arabic (standard), Chinese and Russian data,
251
  | | Global MMLU (CF) | <u>36.51</u> | 32.47 | 34.52 | 34.83 | **38.80** |
252
  | | Flores-200 (5-shot) | 47.13 | 48.74 | 50.74 | <u>54.70</u> | **60.53** |
253
 
254
-
255
- ## Instruction Model
256
-
257
- ### No Extended Thinking
258
- Evaluation results of non reasoning models and reasoning models in no thinking mode. We highlight the best and second-best scores in bold.
259
- | Category | Metric | SmoLLM3-3B | Qwen2.5-3B | Llama3.1-3B | Qwen3-1.7B | Qwen3-4B |
260
- |---------|--------|------------|------------|-------------|------------|----------|
261
- | High school math competition | AIME 2025 | <u>9.3</u> | 2.9 | 0.3 | 8.0 | **17.1** |
262
- | Math problem-solving | GSM-Plus | 72.8 | <u>74.1</u> | 59.2 | 68.3 | **82.1** |
263
- | Competitive programming | LiveCodeBench v4 | <u>15.2</u> | 10.5 | 3.4 | 15.0 | **24.9** |
264
- | Graduate-level reasoning | GPQA Diamond | <u>35.7</u> | 32.2 | 29.4 | 31.8 | **44.4** |
265
- | Instruction following | IFEval | **76.7** | 65.6 | 71.6 | <u>74.0</u> | 68.9 |
266
- | Alignment | MixEval Hard | 26.9 | <u>27.6</u> | 24.9 | 24.3 | **31.6** |
267
- | Knowledge | MMLU-Pro | 45.0 | 41.9 | 36.6 | <u>45.6</u> | **60.9** |
268
- | Multilingual Q&A | Global MMLU | <u>53.5</u> | 50.54 | 46.8 | 49.5 | **65.1** |
269
-
270
- ### Extended Thinking
271
- Evaluation results in reasoning mode for SmolLM3 and Qwen3 models:
272
- | Category | Metric | SmoLLM3-3B | Qwen3-1.7B | Qwen3-4B |
273
- |---------|--------|------------|------------|----------|
274
- | High school math competition | AIME 2025 | <u>36.7</u> | 30.7 | **58.8** |
275
- | Math problem-solving | GSM-Plus | <u>83.4</u> | 79.4 | **88.2** |
276
- | Competitive programming | LiveCodeBench v4 | 30.0 | <u>34.4</u> | **52.9** |
277
- | Graduate-level reasoning | GPQA Diamond | <u>41.7</u> | 39.9 | **55.3** |
278
- | Instruction following | IFEval | 71.2 | <u>74.2</u> | **85.4** |
279
- | Alignment | MixEval Hard | 30.8 | <u>33.9</u> | **38.0** |
280
- | Knowledge | MMLU-Pro | <u>58.4</u> | 57.8 | **70.2** |
281
- | Multilingual Q&A | Global MMLU | <u>64.1</u> | 62.3 | **73.3** |
282
-
283
  ## Training
284
 
285
  ### Model
 
164
  ```python
165
  prompt = "Give me a brief explanation of gravity in simple terms."
166
  messages = [
167
+ {"role": "system", "content": "Speak like a pirate./think"},
168
  {"role": "user", "content": prompt}
169
  ]
170
 
 
179
 
180
  ## Evaluation
181
 
182
+ In this section, we report the evaluation results of SmolLM3 base model. All evaluations are zero-shot unless stated otherwise, and we use [lighteval](https://github.com/huggingface/lighteval) to run them.
183
 
184
  We highlight the best score in bold and underline the second-best score.
185
 
186
+ ## Instruction Model
187
+
188
+ ### No Extended Thinking
189
+ Evaluation results of non reasoning models and reasoning models in no thinking mode. We highlight the best and second-best scores in bold.
190
+ | Category | Metric | SmoLLM3-3B | Qwen2.5-3B | Llama3.1-3B | Qwen3-1.7B | Qwen3-4B |
191
+ |---------|--------|------------|------------|-------------|------------|----------|
192
+ | High school math competition | AIME 2025 | <u>9.3</u> | 2.9 | 0.3 | 8.0 | **17.1** |
193
+ | Math problem-solving | GSM-Plus | 72.8 | <u>74.1</u> | 59.2 | 68.3 | **82.1** |
194
+ | Competitive programming | LiveCodeBench v4 | <u>15.2</u> | 10.5 | 3.4 | 15.0 | **24.9** |
195
+ | Graduate-level reasoning | GPQA Diamond | <u>35.7</u> | 32.2 | 29.4 | 31.8 | **44.4** |
196
+ | Instruction following | IFEval | **76.7** | 65.6 | 71.6 | <u>74.0</u> | 68.9 |
197
+ | Alignment | MixEval Hard | 26.9 | <u>27.6</u> | 24.9 | 24.3 | **31.6** |
198
+ | Tool Calling | BFCL| <u>92.3</u> | - | <u>92.3</u> * | 89.5 | **95.0** |
199
+ | Multilingual Q&A | Global MMLU | <u>53.5</u> | 50.54 | 46.8 | 49.5 | **65.1** |
200
+ (*): this is a tool calling finetune
201
+
202
+ ### Extended Thinking
203
+ Evaluation results in reasoning mode for SmolLM3 and Qwen3 models:
204
+ | Category | Metric | SmoLLM3-3B | Qwen3-1.7B | Qwen3-4B |
205
+ |---------|--------|------------|------------|----------|
206
+ | High school math competition | AIME 2025 | <u>36.7</u> | 30.7 | **58.8** |
207
+ | Math problem-solving | GSM-Plus | <u>83.4</u> | 79.4 | **88.2** |
208
+ | Competitive programming | LiveCodeBench v4 | 30.0 | <u>34.4</u> | **52.9** |
209
+ | Graduate-level reasoning | GPQA Diamond | <u>41.7</u> | 39.9 | **55.3** |
210
+ | Instruction following | IFEval | 71.2 | <u>74.2</u> | **85.4** |
211
+ | Alignment | MixEval Hard | 30.8 | <u>33.9</u> | **38.0** |
212
+ | Tool Calling | BFCL | <u>88.8</u> | <u>88.8</u> | **95.5** |
213
+ | Multilingual Q&A | Global MMLU | <u>64.1</u> | 62.3 | **73.3** |
214
+
215
+
216
  ## Base Pre-Trained Model
217
+ For Ruler 64k evaluation, we apply YaRN to the Qwen models with 32k context to extrapolate the context length.
218
 
219
  ### English benchmarks
220
  Note: All evaluations are zero-shot unless stated otherwise.
 
243
  ### Multilingual benchmarks
244
 
245
 
 
246
  | Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |
247
  |---------|--------|---------------------|------------|--------------|------------------|---------------|
248
  | Main supported languages | | | | | | | |
 
281
  | | Global MMLU (CF) | <u>36.51</u> | 32.47 | 34.52 | 34.83 | **38.80** |
282
  | | Flores-200 (5-shot) | 47.13 | 48.74 | 50.74 | <u>54.70</u> | **60.53** |
283
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
284
  ## Training
285
 
286
  ### Model