File size: 18,331 Bytes
d1c8b5b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a02feaf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d1c8b5b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
---
license: other
license_link: LICENSE
library_name: transformers
pipeline_tag: text-generation
datasets:
  - nvidia/OpenMathInstruct-2
  - a-m-team/AM-DeepSeek-R1-Distilled-1.4M
  - SynthLabsAI/Big-Math-RL-Verified
  - zwhe99/DeepMath-103K
  - agentica-org/DeepScaleR-Preview-Dataset
language:
  - en
base_model:
  - amd/Instella-3B-Instruct
---
<div align="center">
  <br>
  <br>
  <h1>Instella-Math✨: Fully Open Language Model with Reasoning Capability</h1>
<a href='https://huggingface.co/amd/Instella-3B-Math'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue'></a>
<a href='https://rocm.blogs.amd.com/artificial-intelligence/instella-math-language/README.html'><img src='https://img.shields.io/badge/Technical-Blog-red'></a> 
</div>

AMD is thrilled to introduce [Instella-Math](https://huggingface.co/amd/Instella-3B-Math), a reasoning-focused language model that marks a major milestone for AMD: as far as we know, it's **the first language model trained with long chain-of-thought reinforcement learning entirely on AMD GPUs**. Starting from [Instella-3B-Instruct](https://huggingface.co/amd/Instella-3B-Instruct), we extended the model’s capabilities through a multi-stage training pipeline—featuring two stages of supervised fine-tuning and three stages of reinforcement learning using the [VERL framework](https://rocm.blogs.amd.com/artificial-intelligence/verl-large-scale/README.html) —executed entirely on AMD Instinct™ MI300X GPUs.

# Key Takeaways

- Introducing Instella-Math — first reasoning-centric language model with 3 billion parameters from AMD, fully trained on 32 AMD Instinct MI300X GPUs.
- Built on the AMD ROCm software stack, Instella-3B-Math leverages efficient distributed training techniques, including reinforcement learning across 4 MI300X nodes (8 GPUs each), demonstrating the scalability and performance of AMD hardware for cutting-edge AI workloads.
- Instella-Math is an open language model whose architecture, training code, weights, and datasets are publicly available, allowing anyone to inspect, use, modify, or build upon the model.

# Instella-Math
Derived from [Instella-3B-Instruct](https://huggingface.co/amd/Instella-3B-Instruct) with an identical architecture, Instella-3B-Math is optimized for logical reasoning, mathematical problem-solving, and chain-of-thought tasks. The training pipeline features two stages of supervised fine-tuning followed by three reinforcement learning stages using the GRPO algorithm, as shown in figure 1.

<div align="center">
<img src="instella_math_pipeline.png" style="object-fit: contain;"/>
<em><b>Figure 1:</b> Instella-Math Training Steps</em>
</div>

## Example Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "amd/Instella-3B-Math"

tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", trust_remote_code=True)

prompt = [{"role": "user", "content": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? Let's think step by step and output the final answer within \\boxed{}."}]
inputs = tokenizer.apply_chat_template(
    prompt,
    add_generation_prompt=True,
    return_tensors='pt'
)

tokens = model.generate(
    inputs.to(model.device),
    max_new_tokens=1024,
    temperature=0.8,
    do_sample=True
)

print(tokenizer.decode(tokens[0], skip_special_tokens=False))
```

# Supervised Finetuning (SFT) 

We perform a two-stage supervised fine-tuning process to gradually enhance the reasoning capabilities of the Instella-3B-Instruct model. The first stage we use instruction tuning for mathematical coverage. The second stage enables the model to generate in-depth analyses and structured reasoning steps, which are crucial for tackling complex problems like Olympiad-level math questions.

## Stage 1: Instruction Tuning with OpenMathInstruct-2 for Mathematical Coverage

In the first stage of SFT, we begin with instruction tuning, following instructions or prompts properly, especially in a question-answer or problem-solution format. Using the [OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) dataset, which consists of 14 million problem-solution pairs generated from the GSM8K and MATH training sets. The model is trained to follow mathematical prompts covering a diverse range of topics from arithmetic and algebra to probability and calculus.

## Stage 2: Deep Reasoning with Long-Context Training on AM-DeepSeek-R1-Distilled

In the second SFT stage, we further improve the model’s reasoning capability by training on [AM-DeepSeek-R1-Distilled-1.4M](https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M), which is a large-scale general reasoning task dataset with high-quality and challenging reasoning problems. In this stage, we increase the context length of the model from 4K to 32K to allow the model to learn from the long chain-of-thought responses distilled from large reasoning models such as DeepSeek-R1. 

# Reinforcement Learning (GRPO)

## Stage 1: GRPO with 8 Rollouts and 8K Output Contexts

**Training:** In the first stage of reinforcement learning, we apply the Group Relative Policy Optimization (GRPO) algorithm to train the model on [Big-Math-RL-Verified](https://huggingface.co/datasets/SynthLabsAI/Big-Math-RL-Verified), a curated set of complex multi-step math problems. We generate **8 rollouts per prompt**, each allowing up to **8K output tokens**, to explore diverse reasoning trajectories. The model is trained for **1,200 GRPO steps**, using ruled-based reward signals designed by Prime-RL that favor correctness of solutions in the desired format. Training is distributed over **16 MI300X GPUs across 2 nodes**, with VERL and VLLM enabling stable and efficient rollout collection, reward evaluation, and policy updates.

## Stage 2: GRPO with Extended 16 Rollouts and 16K Output Contexts on DeepMath

**Training:** To push the limits of long-form reasoning, we conduct a second GRPO stage on [DeepMath](https://huggingface.co/datasets/zwhe99/DeepMath-103K) using **16 rollouts per prompt** with up to **16K output tokens**. This stage is designed to maximize the model's capacity for deep mathematical reasoning, enabling it to solve problems that require extended derivations, multiple nested logical steps, or structured proof-like outputs. In this stage, training is distributed over **32 MI300X GPUs across 4 nodes**, and the model is trained for **600 GRPO steps**.

## Stage 3: GRPO with Extended 16 Rollouts and 16K Output Contexts on DeepScaleR

Training: To further improve the performance on Olympiad-level math questions, we conduct a third GRPO stage on [DeepScaleR](https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset), which contains original questions from real Olympiad math competitions like AIME (1984-2023) and AMC (prior to 2023). Same as Stage 2, Stage 3 training uses **16 rollouts per prompt** with up to **16K output tokens**.  In this stage, training is distributed over **32 MI300X GPUs across 4 nodes**, and the model is trained for **740 GRPO steps**. 

# Results 
<div class="table-wrapper" align="center">

<table>
    <thead>
        <tr>
            <th></th>
            <th>Size</th>
            <th>MATH 500</th>
            <th>GSM8K</th>
            <th>GPQA-D</th>
            <th>AIME 2024</th>
            <th>AIME 2025</th>
            <th>AMC</th>
            <th>Minerva</th>
            <th>OlympiadBench</th>
            <th>Average</th>
        </tr>
    </thead>
    <tbody>
        <tr>
          <th colspan="11">Open Weight Models</th>
        </tr>
        <tr>
            <th>Qwen2.5-Math-1.5B</th>
            <td>1.5B</td>
            <td>57.81</td>
            <td>66.31</td>
            <td>15.40</td>
            <td>7.71</td>
            <td>3.96</td>
            <td>35.77</td>
            <td>15.72</td>
            <td>25.98</td>
            <td>28.58</td>
        </tr>
        <tr>
            <th>DeepSeek-R1-Distill-Qwen-1.5B</th>
            <td>1.5B</td>
            <td>82.58</td>
            <td>84.06</td>
            <td>16.48</td>
            <td>27.50</td>
            <td>22.50</td>
            <td>63.48</td>
            <td>26.52</td>
            <td>43.00</td>
            <td>45.76</td>
        </tr>
        <tr>
            <th>STILL-3-1.5B-preview</th>
            <td>1.5B</td>
            <td>84.59</td>
            <td>86.57</td>
            <td>19.48</td>
            <td>30.63</td>
            <td>25.21</td>
            <td>66.72</td>
            <td>28.58</td>
            <td>45.29</td>
            <td>48.38</td>
        </tr>
        <tr>
            <th>DeepScaleR-1.5B-Preview</th>
            <td>1.5B</td>
            <td>87.43</td>
            <td>87.34</td>
            <td>16.45</td>
            <td>40.63</td>
            <td>30.83</td>
            <td>73.19</td>
            <td>30.06</td>
            <td>49.89</td>
            <td>51.98</td>
        </tr>
        <tr>
          <th colspan="11">Fully Open Models</th>
        </tr>
        <tr>
            <th>SmolLM3-3B</th>
            <td>3B</td>
            <td>90.16</td>
            <td>92.26</td>
            <td>44.85</td>
            <td>52.50</td>
            <td>35.83</td>
            <td>78.69</td>
            <td>31.76</td>
            <td>55.35</td>
            <td>60.18</td>
        </tr>
        <tr>
            <th>OLMo-2-1124-7B-Instruct</th>
            <td>7B</td>
            <td>32.5</td>
            <td>80.86</td>
            <td>11.14</td>
            <td>1.25</td>
            <td>0.21</td>
            <td>12.27</td>
            <td>10.30</td>
            <td>8.48</td>
            <td>19.63</td>
        </tr>
        <tr>
            <th>Instella-Math SFT</th>
            <td>3B</td>
            <td>77.55</td>
            <td>88.03</td>
            <td>23.36</td>
            <td>20.00</td>
            <td>18.96</td>
            <td>53.92</td>
            <td>18.82</td>
            <td>43.27</td>
            <td>42.99</td>
        </tr>
        <tr>
            <th>Instella-Math RL Stage 1</th>
            <td>3B</td>
            <td>82.16</td>
            <td>90.90</td>
            <td>34.15</td>
            <td>27.92</td>
            <td>22.50</td>
            <td>58.81</td>
            <td>25.05</td>
            <td>49.23</td>
            <td>48.84</td>
        </tr>
        <tr>
            <th>Instella-Math RL Stage 2</th>
            <td>3B</td>
            <td>85.84</td>
            <td>91.72</td>
            <td>37.37</td>
            <td>29.58</td>
            <td>22.92</td>
            <td>66.72</td>
            <td>27.53</td>
            <td>52.67</td>
            <td>51.79</td>
        </tr>
        <tr>
            <th>Instella-Math RL Stage 3</th>
            <td>3B</td>
            <td>86.49</td>
            <td>92.48</td>
            <td>37.63</td>
            <td>35.63</td>
            <td>27.71</td>
            <td>69.73</td>
            <td>27.67</td>
            <td>53.11</td>
            <td>53.80</td>
        </tr>
    </tbody>
</table>
<em><b>Table 1:</b> Instella-Math evaluation results (<i>Pass@1</i>).</em>
</div>

<div class="table-wrapper" align="center">

<table>
    <thead>
        <tr>
            <th></th>
            <th>oTTT</th>
            <th>dTTT</th>
            <th>cTTT</th>
            <th>sTTT</th>
            <th>Average</th>
        </tr>
    </thead>
    <tbody>
        <tr>
          <th colspan="11">Open Weight Models</th>
        </tr>
        <tr>
            <th>Qwen2.5-Math-1.5B</th>
            <td>12.5</td>
            <td>10.00</td>
            <td>18.89</td>
            <td>7.50</td>
            <td>12.22</td>
        </tr>
        <tr> 
            <th>DeepSeek-R1-Distill-Qwen-1.5B</th>
            <td>22.92</td>
            <td>10.06</td>
            <td>18.19</td>
            <td>3.49</td>
            <td>13.67</td>
        </tr>
        <tr>
            <th>STILL-3-1.5B-preview</th>
            <td>24.51</td>
            <td>12.25</td>
            <td>19.79</td>
            <td>3.18</td>
            <td>14.93</td>
        </tr>
        <tr>
            <th>DeepScaleR-1.5B-Preview</th>
            <td>23.04</td>
            <td>16.50</td>
            <td>22.99</td>
            <td>8.18</td>
            <td>17.68</td>
        </tr>
        <tr>
          <th colspan="11">Fully Open Models</th>
        </tr>
        <tr>
            <th>SmolLM3-3B</th>
            <td>51.22</td>
            <td>40.06</td>
            <td>41.32</td>
            <td>42.34</td>
            <td>43.74 </td>
        </tr>
        <tr>
            <th>Instella-Math RL Stage 1</th>
            <td>56.31</td>
            <td>31.37</td>
            <td>39.65</td>
            <td>41.93</td>
            <td>42.32 </td>
        </tr>
        <tr>
            <th>Instella-Math RL Stage 2</th>
            <td>66.2</td>
            <td>37.31</td>
            <td>39.17</td>
            <td>44.48</td>
            <td>46.79</td>
        </tr>
        <tr>
            <th>Instella-Math RL Stage 3</th>
            <td>70.25</td>
            <td>39.56</td>
            <td>40.28</td>
            <td>48.96</td>
            <td>49.76</td>
        </tr>
      </tbody>
</table>
<em><b>Table 2:</b> Instella-Math evaluation results on TTT-Bench. Here, we report <i>Pass@1</i> that is calculated based on 16 responses per question.</em>
</div>

- Following the same evaluation setting as DeepScaleR-1.5B, we report Pass@1 accuracy averaged over 16 responses.
- Instella-Math delivers competitive performance when compared to leading small-scale open-weight models such as Deepseek-R1-Distilled-Qwen-1.5B, Still-3-1.5B, DeepScaleR-1.5B, and SmolLM3-3B.
- Beyond achieving competitive average performance across all benchmarks, Instella-Math demonstrates the effectiveness of our RL training recipe—improving over its supervised finetuned variant (Instella-Math-SFT) by 10.81 points, compared to a 6.22-point improvement seen in DeepScaleR over its base model (Deepseek-R1-Distilled-Qwen-1.5B). 
- Additionally, we test Instella-Math on [TTT-Bench](https://arxiv.org/abs/2506.10209), a new benchmark targeting strategic, spatial, and logical reasoning. Remarkably, without any exposure to TTT-Bench–style or similar strategic gaming data during any stage of training, Instella-Math achieves the best performance among all evaluated models.

# Conclusion

The release of the Instella-Math model marks a major step forward in open-source AI, showcasing the potential of reasoning-focused language models and the scalability of AMD hardware for reinforcement learning and fine-tuning. To our knowledge, Instella-Math is the fully open math reasoning model that is trained on AMD GPUs. As part of AMD's commitment to open innovation, we’re sharing the full model weights, training setup, codebase, and datasets to foster collaboration, transparency, and progress across the AI community. 

We invite researchers, educators, and developers to explore Instella-Math, build on its foundation, and collaborate with us in shaping the next generation of open, interpretable, and high-reasoning language models. 

# Additional Resources 
- Blog: [Introducing Instella-Math: Fully Open Language Model with Reasoning Capability](https://rocm.blogs.amd.com/artificial-intelligence/instella-math-language/README.html)
- Code: [https://github.com/AMD-AIG-AIMA/Instella-Math](https://github.com/AMD-AIG-AIMA/Instella-Math)
- Models:
  - [https://huggingface.co/amd/Instella-3B-Math](https://huggingface.co/amd/Instella-3B-Math)
  - [https://huggingface.co/amd/Instella-3B-Math-SFT](https://huggingface.co/amd/Instella-3B-Math-SFT)


Please refer to the following blogs to get started with using these techniques on AMD GPUs:  

- [Reinforcement Learning from Human Feedback on AMD GPUs with verl and ROCm Integration](https://rocm.blogs.amd.com/artificial-intelligence/verl-large-scale/README.html)
- [PyTorch Fully Sharded Data Parallel (FSDP) on AMD GPUs with ROCm™](https://rocm.blogs.amd.com/artificial-intelligence/fsdp-training-pytorch/README.html)
- [Accelerating Large Language Models with Flash Attention on AMD GPUs](https://rocm.blogs.amd.com/artificial-intelligence/flash-attention/README.html)
- [Accelerate PyTorch Models using torch.compile on AMD GPUs with ROCm™](https://rocm.blogs.amd.com/artificial-intelligence/torch_compile/README.html)

# Bias, Risks, and Limitations 

- The models are being released for research purposes only and are not intended for use cases that require high levels of factuality, safety critical situations, health, or medical applications, generating false information, facilitating toxic conversations. 
- Model checkpoints are made accessible without any safety promises. It is crucial for users to conduct comprehensive evaluations and implement safety filtering mechanisms as per their respective use cases. 
- It may be possible to prompt the model to generate content that may be factually inaccurate, harmful, violent, toxic, biased, or otherwise objectionable. Such content may also get generated by prompts that did not intend to produce output as such. Users are thus requested to be aware of this and exercise caution and responsible thinking when using the model. 
- Multi-lingual abilities of the models have not been tested and thus may misunderstand and generate erroneous responses across different languages.

# License 

The [Instella-Math](https://huggingface.co/amd/Instella-3B-Math) model is licensed for academic and research purposes under a ResearchRAIL license. Refer to the [LICENSE](./LICENSE) and [NOTICE](./NOTICE) files for more information.

## Citations

Feel free to cite our Instella models:

```text
@misc{Instella,
    title = {Instella: Fully Open Language Models with Stellar Performance},
    url = {https://huggingface.co/amd/Instella-3B},
    author = {Jiang Liu, Jialian Wu, Xiaodong Yu, Prakamya Mishra, Sudhanshu Ranjan, Zicheng Liu, Chaitanya Manem, Yusheng Su, Pratik Prabhanjan Brahma, Gowtham Ramesh, Ximeng Sun, Ze Wang, Emad Barsoum},
    month = {March},
    year = {2025}
}
```