RowitZou commited on
Commit
bc0f4b3
·
verified ·
1 Parent(s): 8a36648

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -4
README.md CHANGED
@@ -11,7 +11,7 @@
11
 
12
 
13
  [💻 Github](https://github.com/InternLM/POLAR) |
14
- [📜 Paper](https://arxiv.org/abs/xxxxxx)<br>
15
 
16
  [English](./README.md) |
17
  [简体中文](./README_zh-CN.md)
@@ -37,7 +37,7 @@ POLAR represents a significant breakthrough in scalar-based reward models achiev
37
 
38
  **POLAR-1.8B-Base** refers to the pre-trained-only checkpoint, ideal for customized fine-tuning according to specific preferences. The "ready-to-use" checkpoint **POLAR-1.8B** has been already fine-tuned on general preference data, making it suitable for immediate use in most scenarios.
39
 
40
- We conducted a comprehensive evaluation of POLAR-1.8B via the Proximal Policy Optimization (PPO) algorithm. We evaluate the downstream RL performances of four different policy models using [OpenCompass](https://github.com/internLM/OpenCompass/). More details are available in our [Paper](https://arxiv.org/abs/xxxxxx).
41
 
42
  <img src="./misc/result.png"/><br>
43
 
@@ -225,7 +225,7 @@ Unlike traditional reward models, POLAR requires an additional reference traject
225
 
226
  ### Training steps
227
 
228
- - **Step 0:** Prepare the config. We provide examplar ready-to-use configs [here](./examples/xtuner_configs/POLAR_1_8B_full_varlenattn_custom_dataset.py). If the provided configs cannot meet the requirements, please copy the provided config and do modification following the [xtuner guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/get_started/quickstart.md). For more details of reward model training settings, please see the xtuner [reward model guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/reward_model/modify_settings.md).
229
 
230
  - **Step 1:** Start fine-tuning.
231
 
@@ -368,5 +368,10 @@ Code and model weights are licensed under Apache-2.0.
368
  # Citation
369
 
370
  ```
371
- TBC
 
 
 
 
 
372
  ```
 
11
 
12
 
13
  [💻 Github](https://github.com/InternLM/POLAR) |
14
+ [📜 Paper](https://arxiv.org/abs/2507.05197)<br>
15
 
16
  [English](./README.md) |
17
  [简体中文](./README_zh-CN.md)
 
37
 
38
  **POLAR-1.8B-Base** refers to the pre-trained-only checkpoint, ideal for customized fine-tuning according to specific preferences. The "ready-to-use" checkpoint **POLAR-1.8B** has been already fine-tuned on general preference data, making it suitable for immediate use in most scenarios.
39
 
40
+ We conducted a comprehensive evaluation of POLAR-1.8B via the Proximal Policy Optimization (PPO) algorithm. We evaluate the downstream RL performances of four different policy models using [OpenCompass](https://github.com/internLM/OpenCompass/). More details are available in our [Paper](https://arxiv.org/abs/2507.05197).
41
 
42
  <img src="./misc/result.png"/><br>
43
 
 
225
 
226
  ### Training steps
227
 
228
+ - **Step 0:** Prepare the config. We provide examplar ready-to-use configs [here](https://github.com/InternLM/POLAR/blob/main/examples/xtuner_configs/POLAR_1_8B_full_varlenattn_custom_dataset.py). If the provided configs cannot meet the requirements, please copy the provided config and do modification following the [xtuner guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/get_started/quickstart.md). For more details of reward model training settings, please see the xtuner [reward model guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/reward_model/modify_settings.md).
229
 
230
  - **Step 1:** Start fine-tuning.
231
 
 
368
  # Citation
369
 
370
  ```
371
+ @article{dou2025pretrained,
372
+ title={Pre-Trained Policy Discriminators are General Reward Models},
373
+ author={Dou, Shihan and Liu, Shichun and Yang, Yuming and Zou, Yicheng and Zhou, Yunhua and Xing, Shuhao and Huang, Chenhao and Ge, Qiming and Song, Demin and Lv, Haijun and others},
374
+ journal={arXiv preprint arXiv:2507.05197},
375
+ year={2025}
376
+ }
377
  ```