internlm
/

POLAR-1_8B

@@ -11,7 +11,7 @@
 [💻 Github](https://github.com/InternLM/POLAR) |
-[📜 Paper](https://arxiv.org/abs/xxxxxx)<br>
 [English](./README.md) |
 [简体中文](./README_zh-CN.md)
@@ -37,7 +37,7 @@ POLAR represents a significant breakthrough in scalar-based reward models achiev
 **POLAR-1.8B-Base** refers to the pre-trained-only checkpoint, ideal for customized fine-tuning according to specific preferences. The "ready-to-use" checkpoint **POLAR-1.8B** has been already fine-tuned on general preference data, making it suitable for immediate use in most scenarios.
-We conducted a comprehensive evaluation of POLAR-1.8B via the Proximal Policy Optimization (PPO) algorithm. We evaluate the downstream RL performances of four different policy models using [OpenCompass](https://github.com/internLM/OpenCompass/). More details are available in our [Paper](https://arxiv.org/abs/xxxxxx).
 <img src="./misc/result.png"/><br>
@@ -225,7 +225,7 @@ Unlike traditional reward models, POLAR requires an additional reference traject
 ### Training steps
-- **Step 0:** Prepare the config. We provide examplar ready-to-use configs [here](./examples/xtuner_configs/POLAR_1_8B_full_varlenattn_custom_dataset.py). If the provided configs cannot meet the requirements, please copy the provided config and do modification following the [xtuner guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/get_started/quickstart.md). For more details of reward model training settings, please see the xtuner [reward model guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/reward_model/modify_settings.md).
 - **Step 1:** Start fine-tuning.
@@ -368,5 +368,10 @@ Code and model weights are licensed under Apache-2.0.
 # Citation
 ```
-TBC
 ```

 [💻 Github](https://github.com/InternLM/POLAR) |
+[📜 Paper](https://arxiv.org/abs/2507.05197)<br>
 [English](./README.md) |
 [简体中文](./README_zh-CN.md)
 **POLAR-1.8B-Base** refers to the pre-trained-only checkpoint, ideal for customized fine-tuning according to specific preferences. The "ready-to-use" checkpoint **POLAR-1.8B** has been already fine-tuned on general preference data, making it suitable for immediate use in most scenarios.
+We conducted a comprehensive evaluation of POLAR-1.8B via the Proximal Policy Optimization (PPO) algorithm. We evaluate the downstream RL performances of four different policy models using [OpenCompass](https://github.com/internLM/OpenCompass/). More details are available in our [Paper](https://arxiv.org/abs/2507.05197).
 <img src="./misc/result.png"/><br>
 ### Training steps
+- **Step 0:** Prepare the config. We provide examplar ready-to-use configs [here](https://github.com/InternLM/POLAR/blob/main/examples/xtuner_configs/POLAR_1_8B_full_varlenattn_custom_dataset.py). If the provided configs cannot meet the requirements, please copy the provided config and do modification following the [xtuner guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/get_started/quickstart.md). For more details of reward model training settings, please see the xtuner [reward model guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/reward_model/modify_settings.md).
 - **Step 1:** Start fine-tuning.
 # Citation
 ```
+@article{dou2025pretrained,
+  title={Pre-Trained Policy Discriminators are General Reward Models},
+  author={Dou, Shihan and Liu, Shichun and Yang, Yuming and Zou, Yicheng and Zhou, Yunhua and Xing, Shuhao and Huang, Chenhao and Ge, Qiming and Song, Demin and Lv, Haijun and others},
+  journal={arXiv preprint arXiv:2507.05197},
+  year={2025}
+}
 ```