HuggingFaceTB
/

SmolLM3-3B

@@ -164,7 +164,7 @@ You can specify custom instruction through the system prompt while controlling w
 ```python
 prompt = "Give me a brief explanation of gravity in simple terms."
 messages = [
-    {"role": "system", "content": "Speak like a pirate.  /think"},
     {"role": "user", "content": prompt}
 ]
@@ -179,11 +179,42 @@ For local inference, you can use `llama.cpp`, `ONNX`, `MLX` and `MLC`. You can f
 ## Evaluation
-In this section, we report the evaluation results of SmolLM3 base model. All evaluations are zero-shot unless stated otherwise, and we use [lighteval](https://github.com/huggingface/lighteval) to run them. For Ruler 64k evaluation, we apply YaRN to the Qwen models with 32k context to extrapolate the context length.
 We highlight the best score in bold and underline the second-best score.
 ## Base Pre-Trained Model
 ### English benchmarks
 Note: All evaluations are zero-shot unless stated otherwise.
@@ -212,7 +243,6 @@ Note: All evaluations are zero-shot unless stated otherwise.
 ### Multilingual benchmarks
 | Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |
 |---------|--------|---------------------|------------|--------------|------------------|---------------|
 | Main supported languages |  |  |  |  |  |  |  |
@@ -251,35 +281,6 @@ The model has also been trained on Arabic (standard), Chinese and Russian data,
 | | Global MMLU (CF) | <u>36.51</u> | 32.47 | 34.52 | 34.83 | **38.80** |
 | | Flores-200 (5-shot) | 47.13 | 48.74 | 50.74 | <u>54.70</u> | **60.53** |
-## Instruction Model
-### No Extended Thinking
-Evaluation results of non reasoning models and reasoning models in no thinking mode. We highlight the best and second-best scores in bold.
-| Category | Metric | SmoLLM3-3B | Qwen2.5-3B | Llama3.1-3B | Qwen3-1.7B | Qwen3-4B |
-|---------|--------|------------|------------|-------------|------------|----------|
-| High school math competition | AIME 2025 | <u>9.3</u> | 2.9 | 0.3 | 8.0 | **17.1** |
-| Math problem-solving | GSM-Plus | 72.8 | <u>74.1</u> | 59.2 | 68.3 | **82.1** |
-| Competitive programming | LiveCodeBench v4 | <u>15.2</u> | 10.5 | 3.4 | 15.0 | **24.9** |
-| Graduate-level reasoning | GPQA Diamond | <u>35.7</u> | 32.2 | 29.4 | 31.8 | **44.4** |
-| Instruction following | IFEval | **76.7** | 65.6 | 71.6 | <u>74.0</u> | 68.9 |
-| Alignment | MixEval Hard | 26.9 | <u>27.6</u> | 24.9 | 24.3 | **31.6** |
-| Knowledge | MMLU-Pro | 45.0 | 41.9 | 36.6 | <u>45.6</u>  | **60.9** |
-| Multilingual Q&A | Global MMLU | <u>53.5</u> | 50.54 | 46.8 | 49.5 | **65.1** |
-### Extended Thinking
-Evaluation results in reasoning mode for SmolLM3 and Qwen3 models:
-| Category | Metric | SmoLLM3-3B | Qwen3-1.7B | Qwen3-4B |
-|---------|--------|------------|------------|----------|
-| High school math competition | AIME 2025 | <u>36.7</u> | 30.7 | **58.8** |
-| Math problem-solving | GSM-Plus | <u>83.4</u> | 79.4 | **88.2** |
-| Competitive programming | LiveCodeBench v4 | 30.0 | <u>34.4</u> | **52.9** |
-| Graduate-level reasoning | GPQA Diamond | <u>41.7</u> | 39.9 | **55.3** |
-| Instruction following | IFEval | 71.2 | <u>74.2</u> | **85.4** |
-| Alignment | MixEval Hard | 30.8 | <u>33.9</u> | **38.0** |
-| Knowledge | MMLU-Pro | <u>58.4</u> | 57.8 | **70.2** |
-| Multilingual Q&A | Global MMLU | <u>64.1</u> | 62.3 | **73.3** |
 ## Training
 ### Model

 ```python
 prompt = "Give me a brief explanation of gravity in simple terms."
 messages = [
+    {"role": "system", "content": "Speak like a pirate./think"},
     {"role": "user", "content": prompt}
 ]
 ## Evaluation
+In this section, we report the evaluation results of SmolLM3 base model. All evaluations are zero-shot unless stated otherwise, and we use [lighteval](https://github.com/huggingface/lighteval) to run them.
 We highlight the best score in bold and underline the second-best score.
+## Instruction Model
+### No Extended Thinking
+Evaluation results of non reasoning models and reasoning models in no thinking mode. We highlight the best and second-best scores in bold.
+| Category | Metric | SmoLLM3-3B | Qwen2.5-3B | Llama3.1-3B | Qwen3-1.7B | Qwen3-4B |
+|---------|--------|------------|------------|-------------|------------|----------|
+| High school math competition | AIME 2025 | <u>9.3</u> | 2.9 | 0.3 | 8.0 | **17.1** |
+| Math problem-solving | GSM-Plus | 72.8 | <u>74.1</u> | 59.2 | 68.3 | **82.1** |
+| Competitive programming | LiveCodeBench v4 | <u>15.2</u> | 10.5 | 3.4 | 15.0 | **24.9** |
+| Graduate-level reasoning | GPQA Diamond | <u>35.7</u> | 32.2 | 29.4 | 31.8 | **44.4** |
+| Instruction following | IFEval | **76.7** | 65.6 | 71.6 | <u>74.0</u> | 68.9 |
+| Alignment | MixEval Hard | 26.9 | <u>27.6</u> | 24.9 | 24.3 | **31.6** |
+| Tool Calling | BFCL| <u>92.3</u> | - | <u>92.3</u> * | 89.5  | **95.0** |
+| Multilingual Q&A | Global MMLU | <u>53.5</u> | 50.54 | 46.8 | 49.5 | **65.1** |
+(*): this is a tool calling finetune
+### Extended Thinking
+Evaluation results in reasoning mode for SmolLM3 and Qwen3 models:
+| Category | Metric | SmoLLM3-3B | Qwen3-1.7B | Qwen3-4B |
+|---------|--------|------------|------------|----------|
+| High school math competition | AIME 2025 | <u>36.7</u> | 30.7 | **58.8** |
+| Math problem-solving | GSM-Plus | <u>83.4</u> | 79.4 | **88.2** |
+| Competitive programming | LiveCodeBench v4 | 30.0 | <u>34.4</u> | **52.9** |
+| Graduate-level reasoning | GPQA Diamond | <u>41.7</u> | 39.9 | **55.3** |
+| Instruction following | IFEval | 71.2 | <u>74.2</u> | **85.4** |
+| Alignment | MixEval Hard | 30.8 | <u>33.9</u> | **38.0** |
+| Tool Calling | BFCL | <u>88.8</u> | <u>88.8</u> | **95.5** |
+| Multilingual Q&A | Global MMLU | <u>64.1</u> | 62.3 | **73.3** |
 ## Base Pre-Trained Model
+For Ruler 64k evaluation, we apply YaRN to the Qwen models with 32k context to extrapolate the context length.
 ### English benchmarks
 Note: All evaluations are zero-shot unless stated otherwise.
 ### Multilingual benchmarks
 | Category | Metric | SmolLM3 3B Base | Qwen2.5-3B | Llama3.2 3B | Qwen3 1.7B Base | Qwen3 4B Base |
 |---------|--------|---------------------|------------|--------------|------------------|---------------|
 | Main supported languages |  |  |  |  |  |  |  |
 | | Global MMLU (CF) | <u>36.51</u> | 32.47 | 34.52 | 34.83 | **38.80** |
 | | Flores-200 (5-shot) | 47.13 | 48.74 | 50.74 | <u>54.70</u> | **60.53** |
 ## Training
 ### Model