nielsr HF Staff commited on
Commit
2fb9d59
·
verified ·
1 Parent(s): 961acae

Add pipeline tag and library name to model card

Browse files

This PR improves the model card by adding the `pipeline_tag: text-ranking` to enhance discoverability on the Hugging Face Hub, ensuring users can find your model at https://huggingface.co/models?pipeline_tag=text-ranking. It also adds `library_name: transformers` to correctly associate the model with the Transformers library, enabling the "how to use" code snippets on the model page.

Files changed (1) hide show
  1. README.md +44 -37
README.md CHANGED
@@ -1,16 +1,19 @@
1
  ---
2
- license: apache-2.0
 
3
  language:
4
  - en
5
  - zh
6
- base_model:
7
- - internlm/internlm2_5-7b
8
  tags:
9
  - Reward
10
  - RL
11
  - RFT
12
  - Reward Model
 
 
13
  ---
 
14
  <div align="center">
15
 
16
  <img src="./misc/logo.png" width="400"/><br>
@@ -35,13 +38,13 @@ tags:
35
 
36
  POLAR represents a significant breakthrough in scalar-based reward models achieved through large-scale pre-training. It leverages the innovative **POL**icy Discrimin**A**tive Lea**R**ning (**POLAR**) paradigm——a scalable, high-level optimization objective——to effectively discriminate between policies using a large-scale synthetic corpora. Following pre-training, POLAR RMs are fine-tuned with minimal preference data, rapidly aligning with human preferences. Key features of POLAR include:
37
 
38
- * **Innovative Pre-training Paradigm:** POLAR trains a reward model to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between two policies, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships.
39
 
40
- * **Tailored for Reinforcement Fine-tuning:** POLAR assigns rewards to LLM trajectories based on given references, perfectly aligning with the Reinforcement Fine-tuning (RFT) framework. POLAR provides a promising solution for applying RFT in generic scenarios.
41
 
42
- * **Superior Performance and Generalization:** POLAR achieves state-of-the-art results on downstream reinforcement learning tasks, consistently delivering accurate and reliable reward signals that generalize effectively to unseen scenarios and significantly reducing reward hacking.
43
 
44
- * **Easy to Customize:** Pre-trained checkpoints of POLAR are available, enabling researchers to conveniently fine-tune the RM for various customized scenarios, thus facilitating straightforward adaptation and expansion tailored to specific applications and experimental requirements.
45
 
46
  <img src="./misc/intro.jpeg"/><br>
47
 
@@ -60,18 +63,18 @@ We conducted a comprehensive evaluation of POLAR-7B via the Proximal Policy Opti
60
 
61
  You could employ the latest [xtuner](https://github.com/InternLM/xtuner) to fine-tune and use POLAR. Xtuner is an efficient, flexible and full-featured toolkit for fine-tuning LLMs.
62
 
63
- - It is recommended to build a Python-3.10 virtual environment using conda
64
 
65
- ```bash
66
- conda create --name xtuner-env python=3.10 -y
67
- conda activate xtuner-env
68
- ```
69
 
70
- - Install xtuner via pip
71
 
72
- ```shell
73
- pip install 'git+https://github.com/InternLM/xtuner.git@main#egg=xtuner[deepspeed]'
74
- ```
75
 
76
  ## Inference
77
 
@@ -128,7 +131,7 @@ print(rewards)
128
 
129
  #### Requirements
130
 
131
- - lmdeploy >= 0.9.1
132
 
133
  #### Server Launch
134
 
@@ -158,7 +161,7 @@ print(rewards)
158
 
159
  #### Requirements
160
 
161
- - 0.4.3.post4 <= sglang <= 0.4.4.post1
162
 
163
  #### Server Launch
164
 
@@ -189,7 +192,7 @@ print(rewards)
189
 
190
  #### Requirements
191
 
192
- - vllm >= 0.8.0
193
 
194
  #### Server Launch
195
 
@@ -220,8 +223,8 @@ print(rewards)
220
 
221
  ### Requirements
222
 
223
- - flash_attn
224
- - tensorboard
225
 
226
  ### Data format
227
 
@@ -238,31 +241,31 @@ Unlike traditional reward models, POLAR requires an additional reference traject
238
 
239
  ### Training steps
240
 
241
- - **Step 0:** Prepare the config. We provide examplar ready-to-use configs [here](https://github.com/InternLM/POLAR/blob/main/examples/xtuner_configs/POLAR_7B_full_varlenattn_custom_dataset.py). If the provided configs cannot meet the requirements, please copy the provided config and do modification following the [xtuner guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/get_started/quickstart.md). For more details of reward model training settings, please see the xtuner [reward model guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/reward_model/modify_settings.md).
242
 
243
- - **Step 1:** Start fine-tuning.
244
 
245
  ```shell
246
  xtuner train ${CONFIG_FILE_PATH}
247
  ```
248
 
249
- For example, you can start the fine-tuning of POLAR-7B-Base by
250
 
251
- ```shell
252
- # On a single GPU
253
- xtuner train /path/to/POLAR_7B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
254
 
255
- # On multiple GPUs
256
- NPROC_PER_NODE=${GPU_NUM} xtuner train /path/to/POLAR_7B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
257
- ```
258
 
259
- Here, `--deepspeed` means using [DeepSpeed](https://github.com/microsoft/DeepSpeed) to optimize the training. Xtuner comes with several integrated strategies including ZeRO-1, ZeRO-2, and ZeRO-3. If you wish to disable this feature, simply remove this argument.
260
 
261
- - **Step 2:** Convert the saved PTH model (if using DeepSpeed, it will be a directory) to Hugging Face model, by
262
 
263
- ```shell
264
- xtuner convert pth_to_hf ${CONFIG_FILE_PATH} ${PTH} ${SAVE_PATH}
265
- ```
266
 
267
  # Examples
268
 
@@ -297,7 +300,9 @@ rewards = client(data)
297
  sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
298
 
299
  for output, reward in sorted_res:
300
- print(f"Output: {output}\nReward: {reward}\n")
 
 
301
  ```
302
 
303
  ```txt
@@ -351,7 +356,9 @@ rewards = client(data)
351
  sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
352
 
353
  for output, reward in sorted_res:
354
- print(f"Output: {output}\nReward: {reward}\n")
 
 
355
  ```
356
 
357
  ```txt
 
1
  ---
2
+ base_model:
3
+ - internlm/internlm2_5-7b
4
  language:
5
  - en
6
  - zh
7
+ license: apache-2.0
 
8
  tags:
9
  - Reward
10
  - RL
11
  - RFT
12
  - Reward Model
13
+ pipeline_tag: text-ranking
14
+ library_name: transformers
15
  ---
16
+
17
  <div align="center">
18
 
19
  <img src="./misc/logo.png" width="400"/><br>
 
38
 
39
  POLAR represents a significant breakthrough in scalar-based reward models achieved through large-scale pre-training. It leverages the innovative **POL**icy Discrimin**A**tive Lea**R**ning (**POLAR**) paradigm——a scalable, high-level optimization objective——to effectively discriminate between policies using a large-scale synthetic corpora. Following pre-training, POLAR RMs are fine-tuned with minimal preference data, rapidly aligning with human preferences. Key features of POLAR include:
40
 
41
+ * **Innovative Pre-training Paradigm:** POLAR trains a reward model to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between two policies, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships.
42
 
43
+ * **Tailored for Reinforcement Fine-tuning:** POLAR assigns rewards to LLM trajectories based on given references, perfectly aligning with the Reinforcement Fine-tuning (RFT) framework. POLAR provides a promising solution for applying RFT in generic scenarios.
44
 
45
+ * **Superior Performance and Generalization:** POLAR achieves state-of-the-art results on downstream reinforcement learning tasks, consistently delivering accurate and reliable reward signals that generalize effectively to unseen scenarios and significantly reducing reward hacking.
46
 
47
+ * **Easy to Customize:** Pre-trained checkpoints of POLAR are available, enabling researchers to conveniently fine-tune the RM for various customized scenarios, thus facilitating straightforward adaptation and expansion tailored to specific applications and experimental requirements.
48
 
49
  <img src="./misc/intro.jpeg"/><br>
50
 
 
63
 
64
  You could employ the latest [xtuner](https://github.com/InternLM/xtuner) to fine-tune and use POLAR. Xtuner is an efficient, flexible and full-featured toolkit for fine-tuning LLMs.
65
 
66
+ - It is recommended to build a Python-3.10 virtual environment using conda
67
 
68
+ ```bash
69
+ conda create --name xtuner-env python=3.10 -y
70
+ conda activate xtuner-env
71
+ ```
72
 
73
+ - Install xtuner via pip
74
 
75
+ ```shell
76
+ pip install 'git+https://github.com/InternLM/xtuner.git@main#egg=xtuner[deepspeed]'
77
+ ```
78
 
79
  ## Inference
80
 
 
131
 
132
  #### Requirements
133
 
134
+ - lmdeploy >= 0.9.1
135
 
136
  #### Server Launch
137
 
 
161
 
162
  #### Requirements
163
 
164
+ - 0.4.3.post4 <= sglang <= 0.4.4.post1
165
 
166
  #### Server Launch
167
 
 
192
 
193
  #### Requirements
194
 
195
+ - vllm >= 0.8.0
196
 
197
  #### Server Launch
198
 
 
223
 
224
  ### Requirements
225
 
226
+ - flash_attn
227
+ - tensorboard
228
 
229
  ### Data format
230
 
 
241
 
242
  ### Training steps
243
 
244
+ - **Step 0:** Prepare the config. We provide examplar ready-to-use configs [here](https://github.com/InternLM/POLAR/blob/main/examples/xtuner_configs/POLAR_7B_full_varlenattn_custom_dataset.py). If the provided configs cannot meet the requirements, please copy the provided config and do modification following the [xtuner guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/get_started/quickstart.md). For more details of reward model training settings, please see the xtuner [reward model guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/reward_model/modify_settings.md).
245
 
246
+ - **Step 1:** Start fine-tuning.
247
 
248
  ```shell
249
  xtuner train ${CONFIG_FILE_PATH}
250
  ```
251
 
252
+ For example, you can start the fine-tuning of POLAR-7B-Base by
253
 
254
+ ```shell
255
+ # On a single GPU
256
+ xtuner train /path/to/POLAR_7B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
257
 
258
+ # On multiple GPUs
259
+ NPROC_PER_NODE=${GPU_NUM} xtuner train /path/to/POLAR_7B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
260
+ ```
261
 
262
+ Here, `--deepspeed` means using [DeepSpeed](https://github.com/microsoft/DeepSpeed) to optimize the training. Xtuner comes with several integrated strategies including ZeRO-1, ZeRO-2, and ZeRO-3. If you wish to disable this feature, simply remove this argument.
263
 
264
+ - **Step 2:** Convert the saved PTH model (if using DeepSpeed, it will be a directory) to Hugging Face model, by
265
 
266
+ ```shell
267
+ xtuner convert pth_to_hf ${CONFIG_FILE_PATH} ${PTH} ${SAVE_PATH}
268
+ ```
269
 
270
  # Examples
271
 
 
300
  sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
301
 
302
  for output, reward in sorted_res:
303
+ print(f"Output: {output}
304
+ Reward: {reward}
305
+ ")
306
  ```
307
 
308
  ```txt
 
356
  sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
357
 
358
  for output, reward in sorted_res:
359
+ print(f"Output: {output}
360
+ Reward: {reward}
361
+ ")
362
  ```
363
 
364
  ```txt