Add pipeline_tag, library_name, and paper abstract to model card
Browse filesThis PR improves the model card by adding the `pipeline_tag: text-ranking` and `library_name: transformers` to the metadata. This will enhance discoverability on the Hugging Face Hub (e.g., at https://huggingface.co/models?pipeline_tag=text-ranking) and enable the "how to use" widget for Transformers models.
Additionally, the paper abstract has been added to provide more comprehensive information about the model's purpose and methodology.
README.md
CHANGED
@@ -1,16 +1,19 @@
|
|
1 |
---
|
2 |
-
|
|
|
3 |
language:
|
4 |
- en
|
5 |
- zh
|
6 |
-
|
7 |
-
- internlm/internlm2_5-1_8b
|
8 |
tags:
|
9 |
- Reward
|
10 |
- RL
|
11 |
- RFT
|
12 |
- Reward Model
|
|
|
|
|
13 |
---
|
|
|
14 |
<div align="center">
|
15 |
|
16 |
<img src="./misc/logo.png" width="400"/><br>
|
@@ -31,17 +34,21 @@ tags:
|
|
31 |
|
32 |
</div>
|
33 |
|
|
|
|
|
|
|
|
|
34 |
# Introduction
|
35 |
|
36 |
POLAR represents a significant breakthrough in scalar-based reward models achieved through large-scale pre-training. It leverages the innovative **POL**icy Discrimin**A**tive Lea**R**ning (**POLAR**) paradigm——a scalable, high-level optimization objective——to effectively discriminate between policies using a large-scale synthetic corpora. Following pre-training, POLAR RMs are fine-tuned with minimal preference data, rapidly aligning with human preferences. Key features of POLAR include:
|
37 |
|
38 |
-
*
|
39 |
|
40 |
-
*
|
41 |
|
42 |
-
*
|
43 |
|
44 |
-
*
|
45 |
|
46 |
<img src="./misc/intro.jpeg"/><br>
|
47 |
|
@@ -60,18 +67,18 @@ We conducted a comprehensive evaluation of POLAR-1.8B via the Proximal Policy Op
|
|
60 |
|
61 |
You could employ the latest [xtuner](https://github.com/InternLM/xtuner) to fine-tune and use POLAR. Xtuner is an efficient, flexible and full-featured toolkit for fine-tuning LLMs.
|
62 |
|
63 |
-
-
|
64 |
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
|
70 |
-
-
|
71 |
|
72 |
-
|
73 |
-
|
74 |
-
|
75 |
|
76 |
## Inference
|
77 |
|
@@ -128,7 +135,7 @@ print(rewards)
|
|
128 |
|
129 |
#### Requirements
|
130 |
|
131 |
-
-
|
132 |
|
133 |
#### Server Launch
|
134 |
|
@@ -158,7 +165,7 @@ print(rewards)
|
|
158 |
|
159 |
#### Requirements
|
160 |
|
161 |
-
-
|
162 |
|
163 |
#### Server Launch
|
164 |
|
@@ -189,7 +196,7 @@ print(rewards)
|
|
189 |
|
190 |
#### Requirements
|
191 |
|
192 |
-
-
|
193 |
|
194 |
#### Server Launch
|
195 |
|
@@ -220,8 +227,8 @@ print(rewards)
|
|
220 |
|
221 |
### Requirements
|
222 |
|
223 |
-
-
|
224 |
-
-
|
225 |
|
226 |
### Data format
|
227 |
|
@@ -238,31 +245,31 @@ Unlike traditional reward models, POLAR requires an additional reference traject
|
|
238 |
|
239 |
### Training steps
|
240 |
|
241 |
-
-
|
242 |
|
243 |
-
-
|
244 |
|
245 |
```shell
|
246 |
xtuner train ${CONFIG_FILE_PATH}
|
247 |
```
|
248 |
|
249 |
-
|
250 |
|
251 |
-
|
252 |
-
|
253 |
-
|
254 |
|
255 |
-
|
256 |
-
|
257 |
-
|
258 |
|
259 |
-
|
260 |
|
261 |
-
-
|
262 |
|
263 |
-
|
264 |
-
|
265 |
-
|
266 |
|
267 |
# Examples
|
268 |
|
@@ -297,7 +304,9 @@ rewards = client(data)
|
|
297 |
sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
|
298 |
|
299 |
for output, reward in sorted_res:
|
300 |
-
print(f"Output: {output}
|
|
|
|
|
301 |
```
|
302 |
|
303 |
```txt
|
@@ -351,7 +360,9 @@ rewards = client(data)
|
|
351 |
sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
|
352 |
|
353 |
for output, reward in sorted_res:
|
354 |
-
print(f"Output: {output}
|
|
|
|
|
355 |
```
|
356 |
|
357 |
```txt
|
|
|
1 |
---
|
2 |
+
base_model:
|
3 |
+
- internlm/internlm2_5-1_8b
|
4 |
language:
|
5 |
- en
|
6 |
- zh
|
7 |
+
license: apache-2.0
|
|
|
8 |
tags:
|
9 |
- Reward
|
10 |
- RL
|
11 |
- RFT
|
12 |
- Reward Model
|
13 |
+
pipeline_tag: text-ranking
|
14 |
+
library_name: transformers
|
15 |
---
|
16 |
+
|
17 |
<div align="center">
|
18 |
|
19 |
<img src="./misc/logo.png" width="400"/><br>
|
|
|
34 |
|
35 |
</div>
|
36 |
|
37 |
+
# Abstract
|
38 |
+
|
39 |
+
We offer a novel perspective on reward modeling by formulating it as a policy discriminator, which quantifies the difference between two policies to generate a reward signal, guiding the training policy towards a target policy with desired behaviors. Based on this conceptual insight, we propose a scalable pre-training method named Policy Discriminative Learning (POLAR), which trains a reward model (RM) to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between one policy and an arbitrary target policy, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships. Leveraging the POLAR pre-training paradigm, we present a series of RMs with parameter scales from 1.8B to 7B. Empirical results show that POLAR substantially outperforms traditional non-pre-trained methods, significantly enhancing RM performance. For instance, POLAR-7B could improve preference accuracy from 54.8% to 81.0% on STEM tasks and from 57.9% to 85.5% on creative writing tasks compared to SOTA baselines. POLAR also shows robust generalization capabilities in RLHF using Reinforcement Fine-tuning (RFT), providing reliable reward signals and markedly enhancing policy performance--improving LLaMa3.1-8B from an average of 47.36% to 56.33% and Qwen2.5-32B from 64.49% to 70.47% on 20 benchmarks. Moreover, scaling experiments reveal a clear power-law relationship between computation and performance, supported by linear correlation coefficients approaching 0.99. The impressive performance, strong generalization, and scaling properties suggest that POLAR is a promising direction for developing general and strong reward models.
|
40 |
+
|
41 |
# Introduction
|
42 |
|
43 |
POLAR represents a significant breakthrough in scalar-based reward models achieved through large-scale pre-training. It leverages the innovative **POL**icy Discrimin**A**tive Lea**R**ning (**POLAR**) paradigm——a scalable, high-level optimization objective——to effectively discriminate between policies using a large-scale synthetic corpora. Following pre-training, POLAR RMs are fine-tuned with minimal preference data, rapidly aligning with human preferences. Key features of POLAR include:
|
44 |
|
45 |
+
* **Innovative Pre-training Paradigm:** POLAR trains a reward model to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between two policies, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships.
|
46 |
|
47 |
+
* **Tailored for Reinforcement Fine-tuning:** POLAR assigns rewards to LLM trajectories based on given references, perfectly aligning with the Reinforcement Fine-tuning (RFT) framework. POLAR provides a promising solution for applying RFT in generic scenarios.
|
48 |
|
49 |
+
* **Superior Performance and Generalization:** POLAR achieves state-of-the-art results on downstream reinforcement learning tasks, consistently delivering accurate and reliable reward signals that generalize effectively to unseen scenarios and significantly reducing reward hacking.
|
50 |
|
51 |
+
* **Easy to Customize:** Pre-trained checkpoints of POLAR are available, enabling researchers to conveniently fine-tune the RM for various customized scenarios, thus facilitating straightforward adaptation and expansion tailored to specific applications and experimental requirements.
|
52 |
|
53 |
<img src="./misc/intro.jpeg"/><br>
|
54 |
|
|
|
67 |
|
68 |
You could employ the latest [xtuner](https://github.com/InternLM/xtuner) to fine-tune and use POLAR. Xtuner is an efficient, flexible and full-featured toolkit for fine-tuning LLMs.
|
69 |
|
70 |
+
- It is recommended to build a Python-3.10 virtual environment using conda
|
71 |
|
72 |
+
```bash
|
73 |
+
conda create --name xtuner-env python=3.10 -y
|
74 |
+
conda activate xtuner-env
|
75 |
+
```
|
76 |
|
77 |
+
- Install xtuner via pip
|
78 |
|
79 |
+
```shell
|
80 |
+
pip install 'git+https://github.com/InternLM/xtuner.git@main#egg=xtuner[deepspeed]'
|
81 |
+
```
|
82 |
|
83 |
## Inference
|
84 |
|
|
|
135 |
|
136 |
#### Requirements
|
137 |
|
138 |
+
- lmdeploy >= 0.9.1
|
139 |
|
140 |
#### Server Launch
|
141 |
|
|
|
165 |
|
166 |
#### Requirements
|
167 |
|
168 |
+
- 0.4.3.post4 <= sglang <= 0.4.4.post1
|
169 |
|
170 |
#### Server Launch
|
171 |
|
|
|
196 |
|
197 |
#### Requirements
|
198 |
|
199 |
+
- vllm >= 0.8.0
|
200 |
|
201 |
#### Server Launch
|
202 |
|
|
|
227 |
|
228 |
### Requirements
|
229 |
|
230 |
+
- flash_attn
|
231 |
+
- tensorboard
|
232 |
|
233 |
### Data format
|
234 |
|
|
|
245 |
|
246 |
### Training steps
|
247 |
|
248 |
+
- **Step 0:** Prepare the config. We provide examplar ready-to-use configs [here](https://github.com/InternLM/POLAR/blob/main/examples/xtuner_configs/POLAR_1_8B_full_varlenattn_custom_dataset.py). If the provided configs cannot meet the requirements, please copy the provided config and do modification following the [xtuner guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/get_started/quickstart.md). For more details of reward model training settings, please see the xtuner [reward model guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/reward_model/modify_settings.md).
|
249 |
|
250 |
+
- **Step 1:** Start fine-tuning.
|
251 |
|
252 |
```shell
|
253 |
xtuner train ${CONFIG_FILE_PATH}
|
254 |
```
|
255 |
|
256 |
+
For example, you can start the fine-tuning of POLAR-1_8B-Base by
|
257 |
|
258 |
+
```shell
|
259 |
+
# On a single GPU
|
260 |
+
xtuner train /path/to/POLAR_1_8B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
|
261 |
|
262 |
+
# On multiple GPUs
|
263 |
+
NPROC_PER_NODE=${GPU_NUM} xtuner train /path/to/POLAR_1_8B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
|
264 |
+
```
|
265 |
|
266 |
+
Here, `--deepspeed` means using [DeepSpeed](https://github.com/microsoft/DeepSpeed) to optimize the training. Xtuner comes with several integrated strategies including ZeRO-1, ZeRO-2, and ZeRO-3. If you wish to disable this feature, simply remove this argument.
|
267 |
|
268 |
+
- **Step 2:** Convert the saved PTH model (if using DeepSpeed, it will be a directory) to Hugging Face model, by
|
269 |
|
270 |
+
```shell
|
271 |
+
xtuner convert pth_to_hf ${CONFIG_FILE_PATH} ${PTH} ${SAVE_PATH}
|
272 |
+
```
|
273 |
|
274 |
# Examples
|
275 |
|
|
|
304 |
sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
|
305 |
|
306 |
for output, reward in sorted_res:
|
307 |
+
print(f"Output: {output}
|
308 |
+
Reward: {reward}
|
309 |
+
")
|
310 |
```
|
311 |
|
312 |
```txt
|
|
|
360 |
sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
|
361 |
|
362 |
for output, reward in sorted_res:
|
363 |
+
print(f"Output: {output}
|
364 |
+
Reward: {reward}
|
365 |
+
")
|
366 |
```
|
367 |
|
368 |
```txt
|