|
--- |
|
license: apache-2.0 |
|
--- |
|
<div align="center"> |
|
<img src="./figures/xbai.png" alt="Logo" width="130" height="130"><br> |
|
</div> |
|
|
|
## News |
|
+ **2025.08.01**: We release **XBai o4**, where *o=open*, and o4 represents our fourth-generation open-source large model technology. **XBai o4** excels in complex reasoning capabilities and has now completely surpassed OpenAI-o3-mini in Medium mode. View [Github](https://github.com/MetaStone-AI/XBai-o4/) to get more information! |
|
|
|
## Introduction |
|
<img src="./figures/performance.png" alt="Performance compared with OpenAI-o3-mini" width="800"> |
|
|
|
**XBai o4** is trained based on our proposed **reflective generative form, which combines “Long-CoT Reinforcement Learning” and “Process Reward Learning” into a unified training form**. |
|
This form enables a single model to simultaneously achieve deep reasoning and high-quality reasoning trajectory selection. |
|
By sharing the backbone network between the PRMs and policy models, XBai o4 significantly reduces the inference cost of PRMs by 99%, resulting in faster and higher-quality responses. |
|
|
|
<img src="./figures/intro.jpg" alt="Introduction" width="800"> |
|
|
|
For full details please refer to our [paper](https://arxiv.org/abs/2507.01951). |
|
|
|
|
|
## Performance |
|
|
|
| Model | AIME24 | AIME25 | LiveCodeBench v5| C-EVAL | |
|
|------------------------------|--------|--------|----------------|--------| |
|
| s1-32B | 56.7 | 50.0 | - | - | |
|
| QwQ-32B | 79.5 | 69.5 | 62.7 | 88.4 | |
|
| R1-Distill-Qwen-32B | 72.6 | 49.6 | 54.5 | 82.2 | |
|
| GLM-Z1-32B-0414 | 80.8 | 63.6 | - | - | |
|
| DeepSeek-R1-671B-0120 | 79.8 | 70.0 | 64.3 | **91.8** | |
|
| Claude-3.5-Sonnet1022 | 16.0 | 7.4 | 40.2 | 76.7 | |
|
| GPT-4o-0513 | 9.3 | 11.6 | 32.3 | - | |
|
| OpenAI-o1-mini | 63.6 | 50.7 | 49.4 | 68.9 | |
|
| OpenAI-o1-1217 | 79.2 | - | 63.9 | - | |
|
| OpenAI-o3-mini-medium | 79.6 | 74.8 | 66.3 | 75.9 | |
|
| Claude Opus 4 | 75.7 | 75.5 | 61.3 | - | |
|
| Qwen3-32B | 81.4 | 72.9 | 65.7 | 87.3 | |
|
| **XBai o4-low** | 82.4 | 74.8 | 66.6 | 89.4 | |
|
| **XBai o4-medium** | <ins>85.4</ins> | <ins>77.6</ins> | <ins>67.0</ins> | 89.5 | |
|
| **XBai o4-high** | **86.5** | **77.9** | **67.2** | <ins>89.7</ins> | |
|
|
|
## Model |
|
|
|
We save the parameters of the policy model and the SPRM head into two files: |
|
|
|
- "model.safetensors" is the checkpoint of the policy model. |
|
|
|
- "score_module.pt" is the checkpoint of the SPRM head. |
|
|
|
|
|
You can find other sizes of MetaStone‑S1 below: |
|
|
|
| Model|Transformers(HF) | ModelScope | |
|
|---------------|---------|---------| |
|
|XBai o4|[XBai o4](https://huggingface.co/MetaStoneTec/XBai-o4)|[XBai o4](https://modelscope.cn/models/MetaStoneTec/XBai-o4)| |
|
|
|
|
|
## Training & Evaluation |
|
Since Huggingface models do not directly support inference on SPRM. |
|
Please refer to our [````github repository````](https://github.com/MetaStone-AI/XBai-o4) for the detailed training and testing pipeline. |
|
|
|
|
|
## Citation |
|
If you find our work helpful, feel free to give us a cite. |
|
``` |
|
@misc{wang2025testtimescalingreflectivegenerative, |
|
title={Test-Time Scaling with Reflective Generative Model}, |
|
author={Zixiao Wang and Yuxin Wang and Xiaorui Wang and Mengting Xing and Jie Gao and Jianjun Xu and Guangcan Liu and Chenhui Jin and Zhuo Wang and Shengzhuo Zhang and Hongtao Xie}, |
|
year={2025}, |
|
eprint={2507.01951}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.LG}, |
|
url={https://arxiv.org/abs/2507.01951}, |
|
} |
|
``` |