internlm
/

POLAR-1_8B-Base

Text Classification

feature-extraction

Model card Files Files and versions Community

RowitZou commited on Jul 8

Commit

f88c870

·

verified ·

1 Parent(s): d018696

Update README.md

Files changed (1) hide show

README.md +8 -3

README.md CHANGED Viewed

@@ -11,7 +11,7 @@
 [💻 Github](https://github.com/InternLM/POLAR) |
-[📜 Paper](https://arxiv.org/abs/xxxxxx)<br>
 [English](./README.md) |
 [简体中文](./README_zh-CN.md)
@@ -37,7 +37,7 @@ POLAR represents a significant breakthrough in scalar-based reward models achiev
 **POLAR-1.8B-Base** refers to the pre-trained-only checkpoint, ideal for customized fine-tuning according to specific preferences. The "ready-to-use" checkpoint **POLAR-1.8B** has been already fine-tuned on general preference data, making it suitable for immediate use in most scenarios.
-We conducted a comprehensive evaluation of POLAR-1.8B via the Proximal Policy Optimization (PPO) algorithm. We evaluate the downstream RL performances of four different policy models using [OpenCompass](https://github.com/internLM/OpenCompass/). More details are available in our [Paper](https://arxiv.org/abs/xxxxxx).
 <img src="./misc/result.png"/><br>
@@ -368,5 +368,10 @@ Code and model weights are licensed under Apache-2.0.
 # Citation
 ```
-TBC
 ```

 [💻 Github](https://github.com/InternLM/POLAR) |
+[📜 Paper](https://arxiv.org/abs/2507.05197)<br>
 [English](./README.md) |
 [简体中文](./README_zh-CN.md)
 **POLAR-1.8B-Base** refers to the pre-trained-only checkpoint, ideal for customized fine-tuning according to specific preferences. The "ready-to-use" checkpoint **POLAR-1.8B** has been already fine-tuned on general preference data, making it suitable for immediate use in most scenarios.
+We conducted a comprehensive evaluation of POLAR-1.8B via the Proximal Policy Optimization (PPO) algorithm. We evaluate the downstream RL performances of four different policy models using [OpenCompass](https://github.com/internLM/OpenCompass/). More details are available in our [Paper](https://arxiv.org/abs/2507.05197).
 <img src="./misc/result.png"/><br>
 # Citation
 ```
+@article{dou2025pretrained,
+  title={Pre-Trained Policy Discriminators are General Reward Models},
+  author={Dou, Shihan and Liu, Shichun and Yang, Yuming and Zou, Yicheng and Zhou, Yunhua and Xing, Shuhao and Huang, Chenhao and Ge, Qiming and Song, Demin and Lv, Haijun and others},
+  journal={arXiv preprint arXiv:2507.05197},
+  year={2025}
+}
 ```