nielsr HF Staff commited on
Commit
ac176f3
·
verified ·
1 Parent(s): 863c4a5

Add pipeline_tag, library_name, and paper abstract to model card

Browse files

This PR improves the model card by adding the `pipeline_tag: text-ranking` and `library_name: transformers` to the metadata. This will enhance discoverability on the Hugging Face Hub (e.g., at https://huggingface.co/models?pipeline_tag=text-ranking) and enable the "how to use" widget for Transformers models.

Additionally, the paper abstract has been added to provide more comprehensive information about the model's purpose and methodology.

Files changed (1) hide show
  1. README.md +48 -37
README.md CHANGED
@@ -1,16 +1,19 @@
1
  ---
2
- license: apache-2.0
 
3
  language:
4
  - en
5
  - zh
6
- base_model:
7
- - internlm/internlm2_5-1_8b
8
  tags:
9
  - Reward
10
  - RL
11
  - RFT
12
  - Reward Model
 
 
13
  ---
 
14
  <div align="center">
15
 
16
  <img src="./misc/logo.png" width="400"/><br>
@@ -31,17 +34,21 @@ tags:
31
 
32
  </div>
33
 
 
 
 
 
34
  # Introduction
35
 
36
  POLAR represents a significant breakthrough in scalar-based reward models achieved through large-scale pre-training. It leverages the innovative **POL**icy Discrimin**A**tive Lea**R**ning (**POLAR**) paradigm——a scalable, high-level optimization objective——to effectively discriminate between policies using a large-scale synthetic corpora. Following pre-training, POLAR RMs are fine-tuned with minimal preference data, rapidly aligning with human preferences. Key features of POLAR include:
37
 
38
- * **Innovative Pre-training Paradigm:** POLAR trains a reward model to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between two policies, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships.
39
 
40
- * **Tailored for Reinforcement Fine-tuning:** POLAR assigns rewards to LLM trajectories based on given references, perfectly aligning with the Reinforcement Fine-tuning (RFT) framework. POLAR provides a promising solution for applying RFT in generic scenarios.
41
 
42
- * **Superior Performance and Generalization:** POLAR achieves state-of-the-art results on downstream reinforcement learning tasks, consistently delivering accurate and reliable reward signals that generalize effectively to unseen scenarios and significantly reducing reward hacking.
43
 
44
- * **Easy to Customize:** Pre-trained checkpoints of POLAR are available, enabling researchers to conveniently fine-tune the RM for various customized scenarios, thus facilitating straightforward adaptation and expansion tailored to specific applications and experimental requirements.
45
 
46
  <img src="./misc/intro.jpeg"/><br>
47
 
@@ -60,18 +67,18 @@ We conducted a comprehensive evaluation of POLAR-1.8B via the Proximal Policy Op
60
 
61
  You could employ the latest [xtuner](https://github.com/InternLM/xtuner) to fine-tune and use POLAR. Xtuner is an efficient, flexible and full-featured toolkit for fine-tuning LLMs.
62
 
63
- - It is recommended to build a Python-3.10 virtual environment using conda
64
 
65
- ```bash
66
- conda create --name xtuner-env python=3.10 -y
67
- conda activate xtuner-env
68
- ```
69
 
70
- - Install xtuner via pip
71
 
72
- ```shell
73
- pip install 'git+https://github.com/InternLM/xtuner.git@main#egg=xtuner[deepspeed]'
74
- ```
75
 
76
  ## Inference
77
 
@@ -128,7 +135,7 @@ print(rewards)
128
 
129
  #### Requirements
130
 
131
- - lmdeploy >= 0.9.1
132
 
133
  #### Server Launch
134
 
@@ -158,7 +165,7 @@ print(rewards)
158
 
159
  #### Requirements
160
 
161
- - 0.4.3.post4 <= sglang <= 0.4.4.post1
162
 
163
  #### Server Launch
164
 
@@ -189,7 +196,7 @@ print(rewards)
189
 
190
  #### Requirements
191
 
192
- - vllm >= 0.8.0
193
 
194
  #### Server Launch
195
 
@@ -220,8 +227,8 @@ print(rewards)
220
 
221
  ### Requirements
222
 
223
- - flash_attn
224
- - tensorboard
225
 
226
  ### Data format
227
 
@@ -238,31 +245,31 @@ Unlike traditional reward models, POLAR requires an additional reference traject
238
 
239
  ### Training steps
240
 
241
- - **Step 0:** Prepare the config. We provide examplar ready-to-use configs [here](https://github.com/InternLM/POLAR/blob/main/examples/xtuner_configs/POLAR_1_8B_full_varlenattn_custom_dataset.py). If the provided configs cannot meet the requirements, please copy the provided config and do modification following the [xtuner guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/get_started/quickstart.md). For more details of reward model training settings, please see the xtuner [reward model guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/reward_model/modify_settings.md).
242
 
243
- - **Step 1:** Start fine-tuning.
244
 
245
  ```shell
246
  xtuner train ${CONFIG_FILE_PATH}
247
  ```
248
 
249
- For example, you can start the fine-tuning of POLAR-1_8B-Base by
250
 
251
- ```shell
252
- # On a single GPU
253
- xtuner train /path/to/POLAR_1_8B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
254
 
255
- # On multiple GPUs
256
- NPROC_PER_NODE=${GPU_NUM} xtuner train /path/to/POLAR_1_8B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
257
- ```
258
 
259
- Here, `--deepspeed` means using [DeepSpeed](https://github.com/microsoft/DeepSpeed) to optimize the training. Xtuner comes with several integrated strategies including ZeRO-1, ZeRO-2, and ZeRO-3. If you wish to disable this feature, simply remove this argument.
260
 
261
- - **Step 2:** Convert the saved PTH model (if using DeepSpeed, it will be a directory) to Hugging Face model, by
262
 
263
- ```shell
264
- xtuner convert pth_to_hf ${CONFIG_FILE_PATH} ${PTH} ${SAVE_PATH}
265
- ```
266
 
267
  # Examples
268
 
@@ -297,7 +304,9 @@ rewards = client(data)
297
  sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
298
 
299
  for output, reward in sorted_res:
300
- print(f"Output: {output}\nReward: {reward}\n")
 
 
301
  ```
302
 
303
  ```txt
@@ -351,7 +360,9 @@ rewards = client(data)
351
  sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
352
 
353
  for output, reward in sorted_res:
354
- print(f"Output: {output}\nReward: {reward}\n")
 
 
355
  ```
356
 
357
  ```txt
 
1
  ---
2
+ base_model:
3
+ - internlm/internlm2_5-1_8b
4
  language:
5
  - en
6
  - zh
7
+ license: apache-2.0
 
8
  tags:
9
  - Reward
10
  - RL
11
  - RFT
12
  - Reward Model
13
+ pipeline_tag: text-ranking
14
+ library_name: transformers
15
  ---
16
+
17
  <div align="center">
18
 
19
  <img src="./misc/logo.png" width="400"/><br>
 
34
 
35
  </div>
36
 
37
+ # Abstract
38
+
39
+ We offer a novel perspective on reward modeling by formulating it as a policy discriminator, which quantifies the difference between two policies to generate a reward signal, guiding the training policy towards a target policy with desired behaviors. Based on this conceptual insight, we propose a scalable pre-training method named Policy Discriminative Learning (POLAR), which trains a reward model (RM) to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between one policy and an arbitrary target policy, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships. Leveraging the POLAR pre-training paradigm, we present a series of RMs with parameter scales from 1.8B to 7B. Empirical results show that POLAR substantially outperforms traditional non-pre-trained methods, significantly enhancing RM performance. For instance, POLAR-7B could improve preference accuracy from 54.8% to 81.0% on STEM tasks and from 57.9% to 85.5% on creative writing tasks compared to SOTA baselines. POLAR also shows robust generalization capabilities in RLHF using Reinforcement Fine-tuning (RFT), providing reliable reward signals and markedly enhancing policy performance--improving LLaMa3.1-8B from an average of 47.36% to 56.33% and Qwen2.5-32B from 64.49% to 70.47% on 20 benchmarks. Moreover, scaling experiments reveal a clear power-law relationship between computation and performance, supported by linear correlation coefficients approaching 0.99. The impressive performance, strong generalization, and scaling properties suggest that POLAR is a promising direction for developing general and strong reward models.
40
+
41
  # Introduction
42
 
43
  POLAR represents a significant breakthrough in scalar-based reward models achieved through large-scale pre-training. It leverages the innovative **POL**icy Discrimin**A**tive Lea**R**ning (**POLAR**) paradigm——a scalable, high-level optimization objective——to effectively discriminate between policies using a large-scale synthetic corpora. Following pre-training, POLAR RMs are fine-tuned with minimal preference data, rapidly aligning with human preferences. Key features of POLAR include:
44
 
45
+ * **Innovative Pre-training Paradigm:** POLAR trains a reward model to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between two policies, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships.
46
 
47
+ * **Tailored for Reinforcement Fine-tuning:** POLAR assigns rewards to LLM trajectories based on given references, perfectly aligning with the Reinforcement Fine-tuning (RFT) framework. POLAR provides a promising solution for applying RFT in generic scenarios.
48
 
49
+ * **Superior Performance and Generalization:** POLAR achieves state-of-the-art results on downstream reinforcement learning tasks, consistently delivering accurate and reliable reward signals that generalize effectively to unseen scenarios and significantly reducing reward hacking.
50
 
51
+ * **Easy to Customize:** Pre-trained checkpoints of POLAR are available, enabling researchers to conveniently fine-tune the RM for various customized scenarios, thus facilitating straightforward adaptation and expansion tailored to specific applications and experimental requirements.
52
 
53
  <img src="./misc/intro.jpeg"/><br>
54
 
 
67
 
68
  You could employ the latest [xtuner](https://github.com/InternLM/xtuner) to fine-tune and use POLAR. Xtuner is an efficient, flexible and full-featured toolkit for fine-tuning LLMs.
69
 
70
+ - It is recommended to build a Python-3.10 virtual environment using conda
71
 
72
+ ```bash
73
+ conda create --name xtuner-env python=3.10 -y
74
+ conda activate xtuner-env
75
+ ```
76
 
77
+ - Install xtuner via pip
78
 
79
+ ```shell
80
+ pip install 'git+https://github.com/InternLM/xtuner.git@main#egg=xtuner[deepspeed]'
81
+ ```
82
 
83
  ## Inference
84
 
 
135
 
136
  #### Requirements
137
 
138
+ - lmdeploy >= 0.9.1
139
 
140
  #### Server Launch
141
 
 
165
 
166
  #### Requirements
167
 
168
+ - 0.4.3.post4 <= sglang <= 0.4.4.post1
169
 
170
  #### Server Launch
171
 
 
196
 
197
  #### Requirements
198
 
199
+ - vllm >= 0.8.0
200
 
201
  #### Server Launch
202
 
 
227
 
228
  ### Requirements
229
 
230
+ - flash_attn
231
+ - tensorboard
232
 
233
  ### Data format
234
 
 
245
 
246
  ### Training steps
247
 
248
+ - **Step 0:** Prepare the config. We provide examplar ready-to-use configs [here](https://github.com/InternLM/POLAR/blob/main/examples/xtuner_configs/POLAR_1_8B_full_varlenattn_custom_dataset.py). If the provided configs cannot meet the requirements, please copy the provided config and do modification following the [xtuner guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/get_started/quickstart.md). For more details of reward model training settings, please see the xtuner [reward model guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/reward_model/modify_settings.md).
249
 
250
+ - **Step 1:** Start fine-tuning.
251
 
252
  ```shell
253
  xtuner train ${CONFIG_FILE_PATH}
254
  ```
255
 
256
+ For example, you can start the fine-tuning of POLAR-1_8B-Base by
257
 
258
+ ```shell
259
+ # On a single GPU
260
+ xtuner train /path/to/POLAR_1_8B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
261
 
262
+ # On multiple GPUs
263
+ NPROC_PER_NODE=${GPU_NUM} xtuner train /path/to/POLAR_1_8B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
264
+ ```
265
 
266
+ Here, `--deepspeed` means using [DeepSpeed](https://github.com/microsoft/DeepSpeed) to optimize the training. Xtuner comes with several integrated strategies including ZeRO-1, ZeRO-2, and ZeRO-3. If you wish to disable this feature, simply remove this argument.
267
 
268
+ - **Step 2:** Convert the saved PTH model (if using DeepSpeed, it will be a directory) to Hugging Face model, by
269
 
270
+ ```shell
271
+ xtuner convert pth_to_hf ${CONFIG_FILE_PATH} ${PTH} ${SAVE_PATH}
272
+ ```
273
 
274
  # Examples
275
 
 
304
  sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
305
 
306
  for output, reward in sorted_res:
307
+ print(f"Output: {output}
308
+ Reward: {reward}
309
+ ")
310
  ```
311
 
312
  ```txt
 
360
  sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
361
 
362
  for output, reward in sorted_res:
363
+ print(f"Output: {output}
364
+ Reward: {reward}
365
+ ")
366
  ```
367
 
368
  ```txt