File size: 18,331 Bytes

---
license: other
license_link: LICENSE
library_name: transformers
pipeline_tag: text-generation
datasets:
  - nvidia/OpenMathInstruct-2
  - a-m-team/AM-DeepSeek-R1-Distilled-1.4M
  - SynthLabsAI/Big-Math-RL-Verified
  - zwhe99/DeepMath-103K
  - agentica-org/DeepScaleR-Preview-Dataset
language:
  - en
base_model:
  - amd/Instella-3B-Instruct
---
<div align="center">
  <br>
  <br>
  <h1>Instella-Math✨: Fully Open Language Model with Reasoning Capability</h1>
<a href='https://huggingface.co/amd/Instella-3B-Math'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue'></a>
<a href='https://rocm.blogs.amd.com/artificial-intelligence/instella-math-language/README.html'><img src='https://img.shields.io/badge/Technical-Blog-red'></a> 
</div>

AMD is thrilled to introduce [Instella-Math](https://huggingface.co/amd/Instella-3B-Math), a reasoning-focused language model that marks a major milestone for AMD: as far as we know, it's **the first language model trained with long chain-of-thought reinforcement learning entirely on AMD GPUs**. Starting from [Instella-3B-Instruct](https://huggingface.co/amd/Instella-3B-Instruct), we extended the model’s capabilities through a multi-stage training pipeline—featuring two stages of supervised fine-tuning and three stages of reinforcement learning using the [VERL framework](https://rocm.blogs.amd.com/artificial-intelligence/verl-large-scale/README.html) —executed entirely on AMD Instinct™ MI300X GPUs.

# Key Takeaways

- Introducing Instella-Math — first reasoning-centric language model with 3 billion parameters from AMD, fully trained on 32 AMD Instinct MI300X GPUs.
- Built on the AMD ROCm software stack, Instella-3B-Math leverages efficient distributed training techniques, including reinforcement learning across 4 MI300X nodes (8 GPUs each), demonstrating the scalability and performance of AMD hardware for cutting-edge AI workloads.
- Instella-Math is an open language model whose architecture, training code, weights, and datasets are publicly available, allowing anyone to inspect, use, modify, or build upon the model.

# Instella-Math
Derived from [Instella-3B-Instruct](https://huggingface.co/amd/Instella-3B-Instruct) with an identical architecture, Instella-3B-Math is optimized for logical reasoning, mathematical problem-solving, and chain-of-thought tasks. The training pipeline features two stages of supervised fine-tuning followed by three reinforcement learning stages using the GRPO algorithm, as shown in figure 1.

<div align="center">
<img src="instella_math_pipeline.png" style="object-fit: contain;"/>
<em><b>Figure 1:</b> Instella-Math Training Steps</em>
</div>

## Example Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "amd/Instella-3B-Math"

tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", trust_remote_code=True)

prompt = [{"role": "user", "content": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? Let's think step by step and output the final answer within \\boxed{}."}]
inputs = tokenizer.apply_chat_template(
    prompt,
    add_generation_prompt=True,
    return_tensors='pt'
)

tokens = model.generate(
    inputs.to(model.device),
    max_new_tokens=1024,
    temperature=0.8,
    do_sample=True
)

print(tokenizer.decode(tokens[0], skip_special_tokens=False))
```

# Supervised Finetuning (SFT) 

We perform a two-stage supervised fine-tuning process to gradually enhance the reasoning capabilities of the Instella-3B-Instruct model. The first stage we use instruction tuning for mathematical coverage. The second stage enables the model to generate in-depth analyses and structured reasoning steps, which are crucial for tackling complex problems like Olympiad-level math questions.

## Stage 1: Instruction Tuning with OpenMathInstruct-2 for Mathematical Coverage

In the first stage of SFT, we begin with instruction tuning, following instructions or prompts properly, especially in a question-answer or problem-solution format. Using the [OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) dataset, which consists of 14 million problem-solution pairs generated from the GSM8K and MATH training sets. The model is trained to follow mathematical prompts covering a diverse range of topics from arithmetic and algebra to probability and calculus.

## Stage 2: Deep Reasoning with Long-Context Training on AM-DeepSeek-R1-Distilled

In the second SFT stage, we further improve the model’s reasoning capability by training on [AM-DeepSeek-R1-Distilled-1.4M](https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M), which is a large-scale general reasoning task dataset with high-quality and challenging reasoning problems. In this stage, we increase the context length of the model from 4K to 32K to allow the model to learn from the long chain-of-thought responses distilled from large reasoning models such as DeepSeek-R1. 

# Reinforcement Learning (GRPO)

## Stage 1: GRPO with 8 Rollouts and 8K Output Contexts

**Training:** In the first stage of reinforcement learning, we apply the Group Relative Policy Optimization (GRPO) algorithm to train the model on [Big-Math-RL-Verified](https://huggingface.co/datasets/SynthLabsAI/Big-Math-RL-Verified), a curated set of complex multi-step math problems. We generate **8 rollouts per prompt**, each allowing up to **8K output tokens**, to explore diverse reasoning trajectories. The model is trained for **1,200 GRPO steps**, using ruled-based reward signals designed by Prime-RL that favor correctness of solutions in the desired format. Training is distributed over **16 MI300X GPUs across 2 nodes**, with VERL and VLLM enabling stable and efficient rollout collection, reward evaluation, and policy updates.

## Stage 2: GRPO with Extended 16 Rollouts and 16K Output Contexts on DeepMath

**Training:** To push the limits of long-form reasoning, we conduct a second GRPO stage on [DeepMath](https://huggingface.co/datasets/zwhe99/DeepMath-103K) using **16 rollouts per prompt** with up to **16K output tokens**. This stage is designed to maximize the model's capacity for deep mathematical reasoning, enabling it to solve problems that require extended derivations, multiple nested logical steps, or structured proof-like outputs. In this stage, training is distributed over **32 MI300X GPUs across 4 nodes**, and the model is trained for **600 GRPO steps**.

## Stage 3: GRPO with Extended 16 Rollouts and 16K Output Contexts on DeepScaleR

Training: To further improve the performance on Olympiad-level math questions, we conduct a third GRPO stage on [DeepScaleR](https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset), which contains original questions from real Olympiad math competitions like AIME (1984-2023) and AMC (prior to 2023). Same as Stage 2, Stage 3 training uses **16 rollouts per prompt** with up to **16K output tokens**.  In this stage, training is distributed over **32 MI300X GPUs across 4 nodes**, and the model is trained for **740 GRPO steps**. 

# Results 
<div class="table-wrapper" align="center">

<table>
    <thead>
        <tr>
            <th></th>
            <th>Size</th>
            <th>MATH 500</th>
            <th>GSM8K</th>
            <th>GPQA-D</th>
            <th>AIME 2024</th>
            <th>AIME 2025</th>
            <th>AMC</th>
            <th>Minerva</th>
            <th>OlympiadBench</th>
            <th>Average</th>
        </tr>
    </thead>
    <tbody>
        <tr>
          <th colspan="11">Open Weight Models</th>
        </tr>
        <tr>
            <th>Qwen2.5-Math-1.5B</th>
            <td>1.5B</td>
            <td>57.81</td>
            <td>66.31</td>
            <td>15.40</td>
            <td>7.71</td>
            <td>3.96</td>
            <td>35.77</td>
            <td>15.72</td>
            <td>25.98</td>
            <td>28.58</td>
        </tr>
        <tr>
            <th>DeepSeek-R1-Distill-Qwen-1.5B</th>
            <td>1.5B</td>
            <td>82.58</td>
            <td>84.06</td>
            <td>16.48</td>
            <td>27.50</td>
            <td>22.50</td>
            <td>63.48</td>
            <td>26.52</td>
            <td>43.00</td>
            <td>45.76</td>
        </tr>
        <tr>
            <th>STILL-3-1.5B-preview</th>
            <td>1.5B</td>
            <td>84.59</td>
            <td>86.57</td>
            <td>19.48</td>
            <td>30.63</td>
            <td>25.21</td>
            <td>66.72</td>
            <td>28.58</td>
            <td>45.29</td>
            <td>48.38</td>
        </tr>
        <tr>
            <th>DeepScaleR-1.5B-Preview</th>
            <td>1.5B</td>
            <td>87.43</td>
            <td>87.34</td>
            <td>16.45</td>
            <td>40.63</td>
            <td>30.83</td>
            <td>73.19</td>
            <td>30.06</td>
            <td>49.89</td>
            <td>51.98</td>
        </tr>
        <tr>
          <th colspan="11">Fully Open Models</th>
        </tr>
        <tr>
            <th>SmolLM3-3B</th>
            <td>3B</td>
            <td>90.16</td>
            <td>92.26</td>
            <td>44.85</td>
            <td>52.50</td>
            <td>35.83</td>
            <td>78.69</td>
            <td>31.76</td>
            <td>55.35</td>
            <td>60.18</td>
        </tr>
        <tr>
            <th>OLMo-2-1124-7B-Instruct</th>
            <td>7B</td>
            <td>32.5</td>
            <td>80.86</td>
            <td>11.14</td>
            <td>1.25</td>
            <td>0.21</td>
            <td>12.27</td>
            <td>10.30</td>
            <td>8.48</td>
            <td>19.63</td>
        </tr>
        <tr>
            <th>Instella-Math SFT</th>
            <td>3B</td>
            <td>77.55</td>
            <td>88.03</td>
            <td>23.36</td>
            <td>20.00</td>
            <td>18.96</td>
            <td>53.92</td>
            <td>18.82</td>
            <td>43.27</td>
            <td>42.99</td>
        </tr>
        <tr>
            <th>Instella-Math RL Stage 1</th>
            <td>3B</td>
            <td>82.16</td>
            <td>90.90</td>
            <td>34.15</td>
            <td>27.92</td>
            <td>22.50</td>
            <td>58.81</td>
            <td>25.05</td>
            <td>49.23</td>
            <td>48.84</td>
        </tr>
        <tr>
            <th>Instella-Math RL Stage 2</th>
            <td>3B</td>
            <td>85.84</td>
            <td>91.72</td>
            <td>37.37</td>
            <td>29.58</td>
            <td>22.92</td>
            <td>66.72</td>
            <td>27.53</td>
            <td>52.67</td>
            <td>51.79</td>
        </tr>
        <tr>
            <th>Instella-Math RL Stage 3</th>
            <td>3B</td>
            <td>86.49</td>
            <td>92.48</td>
            <td>37.63</td>
            <td>35.63</td>
            <td>27.71</td>
            <td>69.73</td>
            <td>27.67</td>
            <td>53.11</td>
            <td>53.80</td>
        </tr>
    </tbody>
</table>
<em><b>Table 1:</b> Instella-Math evaluation results (<i>Pass@1</i>).</em>
</div>

<div class="table-wrapper" align="center">

<table>
    <thead>
        <tr>
            <th></th>
            <th>oTTT</th>
            <th>dTTT</th>
            <th>cTTT</th>
            <th>sTTT</th>
            <th>Average</th>
        </tr>
    </thead>
    <tbody>
        <tr>
          <th colspan="11">Open Weight Models</th>
        </tr>
        <tr>
            <th>Qwen2.5-Math-1.5B</th>
            <td>12.5</td>
            <td>10.00</td>
            <td>18.89</td>
            <td>7.50</td>
            <td>12.22</td>
        </tr>
        <tr> 
            <th>DeepSeek-R1-Distill-Qwen-1.5B</th>
            <td>22.92</td>
            <td>10.06</td>
            <td>18.19</td>
            <td>3.49</td>
            <td>13.67</td>
        </tr>
        <tr>
            <th>STILL-3-1.5B-preview</th>
            <td>24.51</td>
            <td>12.25</td>
            <td>19.79</td>
            <td>3.18</td>
            <td>14.93</td>
        </tr>
        <tr>
            <th>DeepScaleR-1.5B-Preview</th>
            <td>23.04</td>
            <td>16.50</td>
            <td>22.99</td>
            <td>8.18</td>
            <td>17.68</td>
        </tr>
        <tr>
          <th colspan="11">Fully Open Models</th>
        </tr>
        <tr>
            <th>SmolLM3-3B</th>
            <td>51.22</td>
            <td>40.06</td>
            <td>41.32</td>
            <td>42.34</td>
            <td>43.74 </td>
        </tr>
        <tr>
            <th>Instella-Math RL Stage 1</th>
            <td>56.31</td>
            <td>31.37</td>
            <td>39.65</td>
            <td>41.93</td>
            <td>42.32 </td>
        </tr>
        <tr>
            <th>Instella-Math RL Stage 2</th>
            <td>66.2</td>
            <td>37.31</td>
            <td>39.17</td>
            <td>44.48</td>
            <td>46.79</td>
        </tr>
        <tr>
            <th>Instella-Math RL Stage 3</th>
            <td>70.25</td>
            <td>39.56</td>
            <td>40.28</td>
            <td>48.96</td>
            <td>49.76</td>
        </tr>
      </tbody>
</table>
<em><b>Table 2:</b> Instella-Math evaluation results on TTT-Bench. Here, we report <i>Pass@1</i> that is calculated based on 16 responses per question.</em>
</div>

- Following the same evaluation setting as DeepScaleR-1.5B, we report Pass@1 accuracy averaged over 16 responses.
- Instella-Math delivers competitive performance when compared to leading small-scale open-weight models such as Deepseek-R1-Distilled-Qwen-1.5B, Still-3-1.5B, DeepScaleR-1.5B, and SmolLM3-3B.
- Beyond achieving competitive average performance across all benchmarks, Instella-Math demonstrates the effectiveness of our RL training recipe—improving over its supervised finetuned variant (Instella-Math-SFT) by 10.81 points, compared to a 6.22-point improvement seen in DeepScaleR over its base model (Deepseek-R1-Distilled-Qwen-1.5B). 
- Additionally, we test Instella-Math on [TTT-Bench](https://arxiv.org/abs/2506.10209), a new benchmark targeting strategic, spatial, and logical reasoning. Remarkably, without any exposure to TTT-Bench–style or similar strategic gaming data during any stage of training, Instella-Math achieves the best performance among all evaluated models.

# Conclusion

The release of the Instella-Math model marks a major step forward in open-source AI, showcasing the potential of reasoning-focused language models and the scalability of AMD hardware for reinforcement learning and fine-tuning. To our knowledge, Instella-Math is the fully open math reasoning model that is trained on AMD GPUs. As part of AMD's commitment to open innovation, we’re sharing the full model weights, training setup, codebase, and datasets to foster collaboration, transparency, and progress across the AI community. 

We invite researchers, educators, and developers to explore Instella-Math, build on its foundation, and collaborate with us in shaping the next generation of open, interpretable, and high-reasoning language models. 

# Additional Resources 
- Blog: [Introducing Instella-Math: Fully Open Language Model with Reasoning Capability](https://rocm.blogs.amd.com/artificial-intelligence/instella-math-language/README.html)
- Code: [https://github.com/AMD-AIG-AIMA/Instella-Math](https://github.com/AMD-AIG-AIMA/Instella-Math)
- Models:
  - [https://huggingface.co/amd/Instella-3B-Math](https://huggingface.co/amd/Instella-3B-Math)
  - [https://huggingface.co/amd/Instella-3B-Math-SFT](https://huggingface.co/amd/Instella-3B-Math-SFT)


Please refer to the following blogs to get started with using these techniques on AMD GPUs:  

- [Reinforcement Learning from Human Feedback on AMD GPUs with verl and ROCm Integration](https://rocm.blogs.amd.com/artificial-intelligence/verl-large-scale/README.html)
- [PyTorch Fully Sharded Data Parallel (FSDP) on AMD GPUs with ROCm™](https://rocm.blogs.amd.com/artificial-intelligence/fsdp-training-pytorch/README.html)
- [Accelerating Large Language Models with Flash Attention on AMD GPUs](https://rocm.blogs.amd.com/artificial-intelligence/flash-attention/README.html)
- [Accelerate PyTorch Models using torch.compile on AMD GPUs with ROCm™](https://rocm.blogs.amd.com/artificial-intelligence/torch_compile/README.html)

# Bias, Risks, and Limitations 

- The models are being released for research purposes only and are not intended for use cases that require high levels of factuality, safety critical situations, health, or medical applications, generating false information, facilitating toxic conversations. 
- Model checkpoints are made accessible without any safety promises. It is crucial for users to conduct comprehensive evaluations and implement safety filtering mechanisms as per their respective use cases. 
- It may be possible to prompt the model to generate content that may be factually inaccurate, harmful, violent, toxic, biased, or otherwise objectionable. Such content may also get generated by prompts that did not intend to produce output as such. Users are thus requested to be aware of this and exercise caution and responsible thinking when using the model. 
- Multi-lingual abilities of the models have not been tested and thus may misunderstand and generate erroneous responses across different languages.

# License 

The [Instella-Math](https://huggingface.co/amd/Instella-3B-Math) model is licensed for academic and research purposes under a ResearchRAIL license. Refer to the [LICENSE](./LICENSE) and [NOTICE](./NOTICE) files for more information.

## Citations

Feel free to cite our Instella models:

```text
@misc{Instella,
    title = {Instella: Fully Open Language Models with Stellar Performance},
    url = {https://huggingface.co/amd/Instella-3B},
    author = {Jiang Liu, Jialian Wu, Xiaodong Yu, Prakamya Mishra, Sudhanshu Ranjan, Zicheng Liu, Chaitanya Manem, Yusheng Su, Pratik Prabhanjan Brahma, Gowtham Ramesh, Ximeng Sun, Ze Wang, Emad Barsoum},
    month = {March},
    year = {2025}
}
```