Update README.md
Browse files
README.md
CHANGED
@@ -11,7 +11,7 @@
|
|
11 |
|
12 |
|
13 |
[💻 Github](https://github.com/InternLM/POLAR) |
|
14 |
-
[📜 Paper](https://arxiv.org/abs/
|
15 |
|
16 |
[English](./README.md) |
|
17 |
[简体中文](./README_zh-CN.md)
|
@@ -37,7 +37,7 @@ POLAR represents a significant breakthrough in scalar-based reward models achiev
|
|
37 |
|
38 |
**POLAR-1.8B-Base** refers to the pre-trained-only checkpoint, ideal for customized fine-tuning according to specific preferences. The "ready-to-use" checkpoint **POLAR-1.8B** has been already fine-tuned on general preference data, making it suitable for immediate use in most scenarios.
|
39 |
|
40 |
-
We conducted a comprehensive evaluation of POLAR-1.8B via the Proximal Policy Optimization (PPO) algorithm. We evaluate the downstream RL performances of four different policy models using [OpenCompass](https://github.com/internLM/OpenCompass/). More details are available in our [Paper](https://arxiv.org/abs/
|
41 |
|
42 |
<img src="./misc/result.png"/><br>
|
43 |
|
@@ -368,5 +368,10 @@ Code and model weights are licensed under Apache-2.0.
|
|
368 |
# Citation
|
369 |
|
370 |
```
|
371 |
-
|
|
|
|
|
|
|
|
|
|
|
372 |
```
|
|
|
11 |
|
12 |
|
13 |
[💻 Github](https://github.com/InternLM/POLAR) |
|
14 |
+
[📜 Paper](https://arxiv.org/abs/2507.05197)<br>
|
15 |
|
16 |
[English](./README.md) |
|
17 |
[简体中文](./README_zh-CN.md)
|
|
|
37 |
|
38 |
**POLAR-1.8B-Base** refers to the pre-trained-only checkpoint, ideal for customized fine-tuning according to specific preferences. The "ready-to-use" checkpoint **POLAR-1.8B** has been already fine-tuned on general preference data, making it suitable for immediate use in most scenarios.
|
39 |
|
40 |
+
We conducted a comprehensive evaluation of POLAR-1.8B via the Proximal Policy Optimization (PPO) algorithm. We evaluate the downstream RL performances of four different policy models using [OpenCompass](https://github.com/internLM/OpenCompass/). More details are available in our [Paper](https://arxiv.org/abs/2507.05197).
|
41 |
|
42 |
<img src="./misc/result.png"/><br>
|
43 |
|
|
|
368 |
# Citation
|
369 |
|
370 |
```
|
371 |
+
@article{dou2025pretrained,
|
372 |
+
title={Pre-Trained Policy Discriminators are General Reward Models},
|
373 |
+
author={Dou, Shihan and Liu, Shichun and Yang, Yuming and Zou, Yicheng and Zhou, Yunhua and Xing, Shuhao and Huang, Chenhao and Ge, Qiming and Song, Demin and Lv, Haijun and others},
|
374 |
+
journal={arXiv preprint arXiv:2507.05197},
|
375 |
+
year={2025}
|
376 |
+
}
|
377 |
```
|