amd
/

Instella-3B-Math-SFT

@@ -1,199 +1,366 @@
 ---
 library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
+license: other
+license_link: LICENSE
 library_name: transformers
+pipeline_tag: text-generation
+datasets:
+  - nvidia/OpenMathInstruct-2
+  - a-m-team/AM-DeepSeek-R1-Distilled-1.4M
+  - SynthLabsAI/Big-Math-RL-Verified
+  - zwhe99/DeepMath-103K
+  - agentica-org/DeepScaleR-Preview-Dataset
+language:
+  - en
+base_model:
+  - amd/Instella-3B-Instruct
 ---
+<div align="center">
+  <br>
+  <br>
+  <h1>Instella-Math✨: Fully Open Language Model with Reasoning Capability</h1>
+<a href='https://huggingface.co/amd/Instella-3B-Math'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue'></a>
+<a href='https://rocm.blogs.amd.com/artificial-intelligence/instella-math-language/README.html'><img src='https://img.shields.io/badge/Technical-Blog-red'></a>
+</div>
+AMD is thrilled to introduce [Instella-Math](https://huggingface.co/amd/Instella-3B-Math), a reasoning-focused language model that marks a major milestone for AMD: as far as we know, it's **the first language model trained with long chain-of-thought reinforcement learning entirely on AMD GPUs**. Starting from [Instella-3B-Instruct](https://huggingface.co/amd/Instella-3B-Instruct), we extended the model’s capabilities through a multi-stage training pipeline—featuring two stages of supervised fine-tuning and three stages of reinforcement learning using the [VERL framework](https://rocm.blogs.amd.com/artificial-intelligence/verl-large-scale/README.html) —executed entirely on AMD Instinct™ MI300X GPUs.
+# Key Takeaways
+- Introducing Instella-Math — first reasoning-centric language model with 3 billion parameters from AMD, fully trained on 32 AMD Instinct MI300X GPUs.
+- Built on the AMD ROCm software stack, Instella-3B-Math leverages efficient distributed training techniques, including reinforcement learning across 4 MI300X nodes (8 GPUs each), demonstrating the scalability and performance of AMD hardware for cutting-edge AI workloads.
+- Instella-Math is an open language model whose architecture, training code, weights, and datasets are publicly available, allowing anyone to inspect, use, modify, or build upon the model.
+# Instella-Math
+Derived from [Instella-3B-Instruct](https://huggingface.co/amd/Instella-3B-Instruct) with an identical architecture, Instella-3B-Math is optimized for logical reasoning, mathematical problem-solving, and chain-of-thought tasks. The training pipeline features two stages of supervised fine-tuning followed by three reinforcement learning stages using the GRPO algorithm, as shown in figure 1.
+<div align="center">
+<img src="instella_math_pipeline.png" style="object-fit: contain;"/>
+<em><b>Figure 1:</b> Instella-Math Training Steps</em>
+</div>
+# Supervised Finetuning (SFT)
+We perform a two-stage supervised fine-tuning process to gradually enhance the reasoning capabilities of the Instella-3B-Instruct model. The first stage we use instruction tuning for mathematical coverage. The second stage enables the model to generate in-depth analyses and structured reasoning steps, which are crucial for tackling complex problems like Olympiad-level math questions.
+## Stage 1: Instruction Tuning with OpenMathInstruct-2 for Mathematical Coverage
+In the first stage of SFT, we begin with instruction tuning, following instructions or prompts properly, especially in a question-answer or problem-solution format. Using the [OpenMathInstruct-2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) dataset, which consists of 14 million problem-solution pairs generated from the GSM8K and MATH training sets. The model is trained to follow mathematical prompts covering a diverse range of topics from arithmetic and algebra to probability and calculus.
+## Stage 2: Deep Reasoning with Long-Context Training on AM-DeepSeek-R1-Distilled
+In the second SFT stage, we further improve the model’s reasoning capability by training on [AM-DeepSeek-R1-Distilled-1.4M](https://huggingface.co/datasets/a-m-team/AM-DeepSeek-R1-Distilled-1.4M), which is a large-scale general reasoning task dataset with high-quality and challenging reasoning problems. In this stage, we increase the context length of the model from 4K to 32K to allow the model to learn from the long chain-of-thought responses distilled from large reasoning models such as DeepSeek-R1.
+# Reinforcement Learning (GRPO)
+## Stage 1: GRPO with 8 Rollouts and 8K Output Contexts
+**Training:** In the first stage of reinforcement learning, we apply the Group Relative Policy Optimization (GRPO) algorithm to train the model on [Big-Math-RL-Verified](https://huggingface.co/datasets/SynthLabsAI/Big-Math-RL-Verified), a curated set of complex multi-step math problems. We generate **8 rollouts per prompt**, each allowing up to **8K output tokens**, to explore diverse reasoning trajectories. The model is trained for **1,200 GRPO steps**, using ruled-based reward signals designed by Prime-RL that favor correctness of solutions in the desired format. Training is distributed over **16 MI300X GPUs across 2 nodes**, with VERL and VLLM enabling stable and efficient rollout collection, reward evaluation, and policy updates.
+## Stage 2: GRPO with Extended 16 Rollouts and 16K Output Contexts on DeepMath
+**Training:** To push the limits of long-form reasoning, we conduct a second GRPO stage on [DeepMath](https://huggingface.co/datasets/zwhe99/DeepMath-103K) using **16 rollouts per prompt** with up to **16K output tokens**. This stage is designed to maximize the model's capacity for deep mathematical reasoning, enabling it to solve problems that require extended derivations, multiple nested logical steps, or structured proof-like outputs. In this stage, training is distributed over **32 MI300X GPUs across 4 nodes**, and the model is trained for **600 GRPO steps**.
+## Stage 3: GRPO with Extended 16 Rollouts and 16K Output Contexts on DeepScaleR
+Training: To further improve the performance on Olympiad-level math questions, we conduct a third GRPO stage on [DeepScaleR](https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset), which contains original questions from real Olympiad math competitions like AIME (1984-2023) and AMC (prior to 2023). Same as Stage 2, Stage 3 training uses **16 rollouts per prompt** with up to **16K output tokens**.  In this stage, training is distributed over **32 MI300X GPUs across 4 nodes**, and the model is trained for **740 GRPO steps**.
+# Results
+<div class="table-wrapper" align="center">
+<table>
+    <thead>
+        <tr>
+            <th></th>
+            <th>Size</th>
+            <th>MATH 500</th>
+            <th>GSM8K</th>
+            <th>GPQA-D</th>
+            <th>AIME 2024</th>
+            <th>AIME 2025</th>
+            <th>AMC</th>
+            <th>Minerva</th>
+            <th>OlympiadBench</th>
+            <th>Average</th>
+        </tr>
+    </thead>
+    <tbody>
+        <tr>
+          <th colspan="11">Open Weight Models</th>
+        </tr>
+        <tr>
+            <th>Qwen2.5-Math-1.5B</th>
+            <td>1.5B</td>
+            <td>57.81</td>
+            <td>66.31</td>
+            <td>15.40</td>
+            <td>7.71</td>
+            <td>3.96</td>
+            <td>35.77</td>
+            <td>15.72</td>
+            <td>25.98</td>
+            <td>28.58</td>
+        </tr>
+        <tr>
+            <th>DeepSeek-R1-Distill-Qwen-1.5B</th>
+            <td>1.5B</td>
+            <td>82.58</td>
+            <td>84.06</td>
+            <td>16.48</td>
+            <td>27.50</td>
+            <td>22.50</td>
+            <td>63.48</td>
+            <td>26.52</td>
+            <td>43.00</td>
+            <td>45.76</td>
+        </tr>
+        <tr>
+            <th>STILL-3-1.5B-preview</th>
+            <td>1.5B</td>
+            <td>84.59</td>
+            <td>86.57</td>
+            <td>19.48</td>
+            <td>30.63</td>
+            <td>25.21</td>
+            <td>66.72</td>
+            <td>28.58</td>
+            <td>45.29</td>
+            <td>48.38</td>
+        </tr>
+        <tr>
+            <th>DeepScaleR-1.5B-Preview</th>
+            <td>1.5B</td>
+            <td>87.43</td>
+            <td>87.34</td>
+            <td>16.45</td>
+            <td>40.63</td>
+            <td>30.83</td>
+            <td>73.19</td>
+            <td>30.06</td>
+            <td>49.89</td>
+            <td>51.98</td>
+        </tr>
+        <tr>
+          <th colspan="11">Fully Open Models</th>
+        </tr>
+        <tr>
+            <th>SmolLM3-3B</th>
+            <td>3B</td>
+            <td>90.16</td>
+            <td>92.26</td>
+            <td>44.85</td>
+            <td>52.50</td>
+            <td>35.83</td>
+            <td>78.69</td>
+            <td>31.76</td>
+            <td>55.35</td>
+            <td>60.18</td>
+        </tr>
+        <tr>
+            <th>OLMo-2-1124-7B-Instruct</th>
+            <td>7B</td>
+            <td>32.5</td>
+            <td>80.86</td>
+            <td>11.14</td>
+            <td>1.25</td>
+            <td>0.21</td>
+            <td>12.27</td>
+            <td>10.30</td>
+            <td>8.48</td>
+            <td>19.63</td>
+        </tr>
+        <tr>
+            <th>Instella-Math SFT</th>
+            <td>3B</td>
+            <td>77.55</td>
+            <td>88.03</td>
+            <td>23.36</td>
+            <td>20.00</td>
+            <td>18.96</td>
+            <td>53.92</td>
+            <td>18.82</td>
+            <td>43.27</td>
+            <td>42.99</td>
+        </tr>
+        <tr>
+            <th>Instella-Math RL Stage 1</th>
+            <td>3B</td>
+            <td>82.16</td>
+            <td>90.90</td>
+            <td>34.15</td>
+            <td>27.92</td>
+            <td>22.50</td>
+            <td>58.81</td>
+            <td>25.05</td>
+            <td>49.23</td>
+            <td>48.84</td>
+        </tr>
+        <tr>
+            <th>Instella-Math RL Stage 2</th>
+            <td>3B</td>
+            <td>85.84</td>
+            <td>91.72</td>
+            <td>37.37</td>
+            <td>29.58</td>
+            <td>22.92</td>
+            <td>66.72</td>
+            <td>27.53</td>
+            <td>52.67</td>
+            <td>51.79</td>
+        </tr>
+        <tr>
+            <th>Instella-Math RL Stage 3</th>
+            <td>3B</td>
+            <td>86.49</td>
+            <td>92.48</td>
+            <td>37.63</td>
+            <td>35.63</td>
+            <td>27.71</td>
+            <td>69.73</td>
+            <td>27.67</td>
+            <td>53.11</td>
+            <td>53.80</td>
+        </tr>
+    </tbody>
+</table>
+<em><b>Table 1:</b> Instella-Math evaluation results (<i>Pass@1</i>).</em>
+</div>
+<div class="table-wrapper" align="center">
+<table>
+    <thead>
+        <tr>
+            <th></th>
+            <th>oTTT</th>
+            <th>dTTT</th>
+            <th>cTTT</th>
+            <th>sTTT</th>
+            <th>Average</th>
+        </tr>
+    </thead>
+    <tbody>
+        <tr>
+          <th colspan="11">Open Weight Models</th>
+        </tr>
+        <tr>
+            <th>Qwen2.5-Math-1.5B</th>
+            <td>12.5</td>
+            <td>10.00</td>
+            <td>18.89</td>
+            <td>7.50</td>
+            <td>12.22</td>
+        </tr>
+        <tr>
+            <th>DeepSeek-R1-Distill-Qwen-1.5B</th>
+            <td>22.92</td>
+            <td>10.06</td>
+            <td>18.19</td>
+            <td>3.49</td>
+            <td>13.67</td>
+        </tr>
+        <tr>
+            <th>STILL-3-1.5B-preview</th>
+            <td>24.51</td>
+            <td>12.25</td>
+            <td>19.79</td>
+            <td>3.18</td>
+            <td>14.93</td>
+        </tr>
+        <tr>
+            <th>DeepScaleR-1.5B-Preview</th>
+            <td>23.04</td>
+            <td>16.50</td>
+            <td>22.99</td>
+            <td>8.18</td>
+            <td>17.68</td>
+        </tr>
+        <tr>
+          <th colspan="11">Fully Open Models</th>
+        </tr>
+        <tr>
+            <th>SmolLM3-3B</th>
+            <td>51.22</td>
+            <td>40.06</td>
+            <td>41.32</td>
+            <td>42.34</td>
+            <td>43.74 </td>
+        </tr>
+        <tr>
+            <th>Instella-Math RL Stage 1</th>
+            <td>56.31</td>
+            <td>31.37</td>
+            <td>39.65</td>
+            <td>41.93</td>
+            <td>42.32 </td>
+        </tr>
+        <tr>
+            <th>Instella-Math RL Stage 2</th>
+            <td>66.2</td>
+            <td>37.31</td>
+            <td>39.17</td>
+            <td>44.48</td>
+            <td>46.79</td>
+        </tr>
+        <tr>
+            <th>Instella-Math RL Stage 3</th>
+            <td>70.25</td>
+            <td>39.56</td>
+            <td>40.28</td>
+            <td>48.96</td>
+            <td>49.76</td>
+        </tr>
+      </tbody>
+</table>
+<em><b>Table 2:</b> Instella-Math evaluation results on TTT-Bench. Here, we report <i>Pass@1</i> that is calculated based on 16 responses per question.</em>
+</div>
+- Following the same evaluation setting as DeepScaleR-1.5B, we report Pass@1 accuracy averaged over 16 responses.
+- Instella-Math delivers competitive performance when compared to leading small-scale open-weight models such as Deepseek-R1-Distilled-Qwen-1.5B, Still-3-1.5B, DeepScaleR-1.5B, and SmolLM3-3B.
+- Beyond achieving competitive average performance across all benchmarks, Instella-Math demonstrates the effectiveness of our RL training recipe—improving over its supervised finetuned variant (Instella-Math-SFT) by 10.81 points, compared to a 6.22-point improvement seen in DeepScaleR over its base model (Deepseek-R1-Distilled-Qwen-1.5B).
+- Additionally, we test Instella-Math on [TTT-Bench](https://arxiv.org/abs/2506.10209), a new benchmark targeting strategic, spatial, and logical reasoning. Remarkably, without any exposure to TTT-Bench–style or similar strategic gaming data during any stage of training, Instella-Math achieves the best performance among all evaluated models.
+# Conclusion
+The release of the Instella-Math model marks a major step forward in open-source AI, showcasing the potential of reasoning-focused language models and the scalability of AMD hardware for reinforcement learning and fine-tuning. To our knowledge, Instella-Math is the fully open math reasoning model that is trained on AMD GPUs. As part of AMD's commitment to open innovation, we’re sharing the full model weights, training setup, codebase, and datasets to foster collaboration, transparency, and progress across the AI community.
+We invite researchers, educators, and developers to explore Instella-Math, build on its foundation, and collaborate with us in shaping the next generation of open, interpretable, and high-reasoning language models.
+# Additional Resources
+- Blog: [Introducing Instella-Math: Fully Open Language Model with Reasoning Capability](https://rocm.blogs.amd.com/artificial-intelligence/instella-math-language/README.html)
+- Code: [https://github.com/AMD-AIG-AIMA/Instella-Math](https://github.com/AMD-AIG-AIMA/Instella-Math)
+- Models:
+  - [https://huggingface.co/amd/Instella-3B-Math](https://huggingface.co/amd/Instella-3B-Math)
+  - [https://huggingface.co/amd/Instella-3B-Math-SFT](https://huggingface.co/amd/Instella-3B-Math-SFT)
+Please refer to the following blogs to get started with using these techniques on AMD GPUs:
+- [Reinforcement Learning from Human Feedback on AMD GPUs with verl and ROCm Integration](https://rocm.blogs.amd.com/artificial-intelligence/verl-large-scale/README.html)
+- [PyTorch Fully Sharded Data Parallel (FSDP) on AMD GPUs with ROCm™](https://rocm.blogs.amd.com/artificial-intelligence/fsdp-training-pytorch/README.html)
+- [Accelerating Large Language Models with Flash Attention on AMD GPUs](https://rocm.blogs.amd.com/artificial-intelligence/flash-attention/README.html)
+- [Accelerate PyTorch Models using torch.compile on AMD GPUs with ROCm™](https://rocm.blogs.amd.com/artificial-intelligence/torch_compile/README.html)
+# Bias, Risks, and Limitations
+- The models are being released for research purposes only and are not intended for use cases that require high levels of factuality, safety critical situations, health, or medical applications, generating false information, facilitating toxic conversations.
+- Model checkpoints are made accessible without any safety promises. It is crucial for users to conduct comprehensive evaluations and implement safety filtering mechanisms as per their respective use cases.
+- It may be possible to prompt the model to generate content that may be factually inaccurate, harmful, violent, toxic, biased, or otherwise objectionable. Such content may also get generated by prompts that did not intend to produce output as such. Users are thus requested to be aware of this and exercise caution and responsible thinking when using the model.
+- Multi-lingual abilities of the models have not been tested and thus may misunderstand and generate erroneous responses across different languages.
+# License
+The [Instella-Math](https://huggingface.co/amd/Instella-3B-Math) model is licensed for academic and research purposes under a ResearchRAIL license. Refer to the [LICENSE](./LICENSE) and [NOTICE](./NOTICE) files for more information.
+## Citations
+Feel free to cite our Instella models:
+```text
+@misc{Instella,
+    title = {Instella: Fully Open Language Models with Stellar Performance},
+    url = {https://huggingface.co/amd/Instella-3B},
+    author = {Jiang Liu, Jialian Wu, Xiaodong Yu, Prakamya Mishra, Sudhanshu Ranjan, Zicheng Liu, Chaitanya Manem, Yusheng Su, Pratik Prabhanjan Brahma, Gowtham Ramesh, Ximeng Sun, Ze Wang, Emad Barsoum},
+    month = {March},
+    year = {2025}
+}
+```