Update README.md
Browse files
README.md
CHANGED
@@ -11,7 +11,7 @@
|
|
11 |
|
12 |
|
13 |
[💻 Github](https://github.com/InternLM/POLAR) |
|
14 |
-
[📜 Paper](https://arxiv.org/abs/
|
15 |
|
16 |
[English](./README.md) |
|
17 |
[简体中文](./README_zh-CN.md)
|
@@ -37,7 +37,7 @@ POLAR represents a significant breakthrough in scalar-based reward models achiev
|
|
37 |
|
38 |
**POLAR-1.8B-Base** refers to the pre-trained-only checkpoint, ideal for customized fine-tuning according to specific preferences. The "ready-to-use" checkpoint **POLAR-1.8B** has been already fine-tuned on general preference data, making it suitable for immediate use in most scenarios.
|
39 |
|
40 |
-
We conducted a comprehensive evaluation of POLAR-1.8B via the Proximal Policy Optimization (PPO) algorithm. We evaluate the downstream RL performances of four different policy models using [OpenCompass](https://github.com/internLM/OpenCompass/). More details are available in our [Paper](https://arxiv.org/abs/
|
41 |
|
42 |
<img src="./misc/result.png"/><br>
|
43 |
|
@@ -225,7 +225,7 @@ Unlike traditional reward models, POLAR requires an additional reference traject
|
|
225 |
|
226 |
### Training steps
|
227 |
|
228 |
-
- **Step 0:** Prepare the config. We provide examplar ready-to-use configs [here](
|
229 |
|
230 |
- **Step 1:** Start fine-tuning.
|
231 |
|
@@ -368,5 +368,10 @@ Code and model weights are licensed under Apache-2.0.
|
|
368 |
# Citation
|
369 |
|
370 |
```
|
371 |
-
|
|
|
|
|
|
|
|
|
|
|
372 |
```
|
|
|
11 |
|
12 |
|
13 |
[💻 Github](https://github.com/InternLM/POLAR) |
|
14 |
+
[📜 Paper](https://arxiv.org/abs/2507.05197)<br>
|
15 |
|
16 |
[English](./README.md) |
|
17 |
[简体中文](./README_zh-CN.md)
|
|
|
37 |
|
38 |
**POLAR-1.8B-Base** refers to the pre-trained-only checkpoint, ideal for customized fine-tuning according to specific preferences. The "ready-to-use" checkpoint **POLAR-1.8B** has been already fine-tuned on general preference data, making it suitable for immediate use in most scenarios.
|
39 |
|
40 |
+
We conducted a comprehensive evaluation of POLAR-1.8B via the Proximal Policy Optimization (PPO) algorithm. We evaluate the downstream RL performances of four different policy models using [OpenCompass](https://github.com/internLM/OpenCompass/). More details are available in our [Paper](https://arxiv.org/abs/2507.05197).
|
41 |
|
42 |
<img src="./misc/result.png"/><br>
|
43 |
|
|
|
225 |
|
226 |
### Training steps
|
227 |
|
228 |
+
- **Step 0:** Prepare the config. We provide examplar ready-to-use configs [here](https://github.com/InternLM/POLAR/blob/main/examples/xtuner_configs/POLAR_1_8B_full_varlenattn_custom_dataset.py). If the provided configs cannot meet the requirements, please copy the provided config and do modification following the [xtuner guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/get_started/quickstart.md). For more details of reward model training settings, please see the xtuner [reward model guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/reward_model/modify_settings.md).
|
229 |
|
230 |
- **Step 1:** Start fine-tuning.
|
231 |
|
|
|
368 |
# Citation
|
369 |
|
370 |
```
|
371 |
+
@article{dou2025pretrained,
|
372 |
+
title={Pre-Trained Policy Discriminators are General Reward Models},
|
373 |
+
author={Dou, Shihan and Liu, Shichun and Yang, Yuming and Zou, Yicheng and Zhou, Yunhua and Xing, Shuhao and Huang, Chenhao and Ge, Qiming and Song, Demin and Lv, Haijun and others},
|
374 |
+
journal={arXiv preprint arXiv:2507.05197},
|
375 |
+
year={2025}
|
376 |
+
}
|
377 |
```
|