Add pipeline tag and library name to model card
Browse filesThis PR improves the model card by adding the `pipeline_tag: text-ranking` to enhance discoverability on the Hugging Face Hub, ensuring users can find your model at https://huggingface.co/models?pipeline_tag=text-ranking. It also adds `library_name: transformers` to correctly associate the model with the Transformers library, enabling the "how to use" code snippets on the model page.
README.md
CHANGED
@@ -1,16 +1,19 @@
|
|
1 |
---
|
2 |
-
|
|
|
3 |
language:
|
4 |
- en
|
5 |
- zh
|
6 |
-
|
7 |
-
- internlm/internlm2_5-7b
|
8 |
tags:
|
9 |
- Reward
|
10 |
- RL
|
11 |
- RFT
|
12 |
- Reward Model
|
|
|
|
|
13 |
---
|
|
|
14 |
<div align="center">
|
15 |
|
16 |
<img src="./misc/logo.png" width="400"/><br>
|
@@ -35,13 +38,13 @@ tags:
|
|
35 |
|
36 |
POLAR represents a significant breakthrough in scalar-based reward models achieved through large-scale pre-training. It leverages the innovative **POL**icy Discrimin**A**tive Lea**R**ning (**POLAR**) paradigm——a scalable, high-level optimization objective——to effectively discriminate between policies using a large-scale synthetic corpora. Following pre-training, POLAR RMs are fine-tuned with minimal preference data, rapidly aligning with human preferences. Key features of POLAR include:
|
37 |
|
38 |
-
*
|
39 |
|
40 |
-
*
|
41 |
|
42 |
-
*
|
43 |
|
44 |
-
*
|
45 |
|
46 |
<img src="./misc/intro.jpeg"/><br>
|
47 |
|
@@ -60,18 +63,18 @@ We conducted a comprehensive evaluation of POLAR-7B via the Proximal Policy Opti
|
|
60 |
|
61 |
You could employ the latest [xtuner](https://github.com/InternLM/xtuner) to fine-tune and use POLAR. Xtuner is an efficient, flexible and full-featured toolkit for fine-tuning LLMs.
|
62 |
|
63 |
-
-
|
64 |
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
|
70 |
-
-
|
71 |
|
72 |
-
|
73 |
-
|
74 |
-
|
75 |
|
76 |
## Inference
|
77 |
|
@@ -128,7 +131,7 @@ print(rewards)
|
|
128 |
|
129 |
#### Requirements
|
130 |
|
131 |
-
-
|
132 |
|
133 |
#### Server Launch
|
134 |
|
@@ -158,7 +161,7 @@ print(rewards)
|
|
158 |
|
159 |
#### Requirements
|
160 |
|
161 |
-
-
|
162 |
|
163 |
#### Server Launch
|
164 |
|
@@ -189,7 +192,7 @@ print(rewards)
|
|
189 |
|
190 |
#### Requirements
|
191 |
|
192 |
-
-
|
193 |
|
194 |
#### Server Launch
|
195 |
|
@@ -220,8 +223,8 @@ print(rewards)
|
|
220 |
|
221 |
### Requirements
|
222 |
|
223 |
-
-
|
224 |
-
-
|
225 |
|
226 |
### Data format
|
227 |
|
@@ -238,31 +241,31 @@ Unlike traditional reward models, POLAR requires an additional reference traject
|
|
238 |
|
239 |
### Training steps
|
240 |
|
241 |
-
-
|
242 |
|
243 |
-
-
|
244 |
|
245 |
```shell
|
246 |
xtuner train ${CONFIG_FILE_PATH}
|
247 |
```
|
248 |
|
249 |
-
|
250 |
|
251 |
-
|
252 |
-
|
253 |
-
|
254 |
|
255 |
-
|
256 |
-
|
257 |
-
|
258 |
|
259 |
-
|
260 |
|
261 |
-
-
|
262 |
|
263 |
-
|
264 |
-
|
265 |
-
|
266 |
|
267 |
# Examples
|
268 |
|
@@ -297,7 +300,9 @@ rewards = client(data)
|
|
297 |
sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
|
298 |
|
299 |
for output, reward in sorted_res:
|
300 |
-
print(f"Output: {output}
|
|
|
|
|
301 |
```
|
302 |
|
303 |
```txt
|
@@ -351,7 +356,9 @@ rewards = client(data)
|
|
351 |
sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
|
352 |
|
353 |
for output, reward in sorted_res:
|
354 |
-
print(f"Output: {output}
|
|
|
|
|
355 |
```
|
356 |
|
357 |
```txt
|
|
|
1 |
---
|
2 |
+
base_model:
|
3 |
+
- internlm/internlm2_5-7b
|
4 |
language:
|
5 |
- en
|
6 |
- zh
|
7 |
+
license: apache-2.0
|
|
|
8 |
tags:
|
9 |
- Reward
|
10 |
- RL
|
11 |
- RFT
|
12 |
- Reward Model
|
13 |
+
pipeline_tag: text-ranking
|
14 |
+
library_name: transformers
|
15 |
---
|
16 |
+
|
17 |
<div align="center">
|
18 |
|
19 |
<img src="./misc/logo.png" width="400"/><br>
|
|
|
38 |
|
39 |
POLAR represents a significant breakthrough in scalar-based reward models achieved through large-scale pre-training. It leverages the innovative **POL**icy Discrimin**A**tive Lea**R**ning (**POLAR**) paradigm——a scalable, high-level optimization objective——to effectively discriminate between policies using a large-scale synthetic corpora. Following pre-training, POLAR RMs are fine-tuned with minimal preference data, rapidly aligning with human preferences. Key features of POLAR include:
|
40 |
|
41 |
+
* **Innovative Pre-training Paradigm:** POLAR trains a reward model to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between two policies, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships.
|
42 |
|
43 |
+
* **Tailored for Reinforcement Fine-tuning:** POLAR assigns rewards to LLM trajectories based on given references, perfectly aligning with the Reinforcement Fine-tuning (RFT) framework. POLAR provides a promising solution for applying RFT in generic scenarios.
|
44 |
|
45 |
+
* **Superior Performance and Generalization:** POLAR achieves state-of-the-art results on downstream reinforcement learning tasks, consistently delivering accurate and reliable reward signals that generalize effectively to unseen scenarios and significantly reducing reward hacking.
|
46 |
|
47 |
+
* **Easy to Customize:** Pre-trained checkpoints of POLAR are available, enabling researchers to conveniently fine-tune the RM for various customized scenarios, thus facilitating straightforward adaptation and expansion tailored to specific applications and experimental requirements.
|
48 |
|
49 |
<img src="./misc/intro.jpeg"/><br>
|
50 |
|
|
|
63 |
|
64 |
You could employ the latest [xtuner](https://github.com/InternLM/xtuner) to fine-tune and use POLAR. Xtuner is an efficient, flexible and full-featured toolkit for fine-tuning LLMs.
|
65 |
|
66 |
+
- It is recommended to build a Python-3.10 virtual environment using conda
|
67 |
|
68 |
+
```bash
|
69 |
+
conda create --name xtuner-env python=3.10 -y
|
70 |
+
conda activate xtuner-env
|
71 |
+
```
|
72 |
|
73 |
+
- Install xtuner via pip
|
74 |
|
75 |
+
```shell
|
76 |
+
pip install 'git+https://github.com/InternLM/xtuner.git@main#egg=xtuner[deepspeed]'
|
77 |
+
```
|
78 |
|
79 |
## Inference
|
80 |
|
|
|
131 |
|
132 |
#### Requirements
|
133 |
|
134 |
+
- lmdeploy >= 0.9.1
|
135 |
|
136 |
#### Server Launch
|
137 |
|
|
|
161 |
|
162 |
#### Requirements
|
163 |
|
164 |
+
- 0.4.3.post4 <= sglang <= 0.4.4.post1
|
165 |
|
166 |
#### Server Launch
|
167 |
|
|
|
192 |
|
193 |
#### Requirements
|
194 |
|
195 |
+
- vllm >= 0.8.0
|
196 |
|
197 |
#### Server Launch
|
198 |
|
|
|
223 |
|
224 |
### Requirements
|
225 |
|
226 |
+
- flash_attn
|
227 |
+
- tensorboard
|
228 |
|
229 |
### Data format
|
230 |
|
|
|
241 |
|
242 |
### Training steps
|
243 |
|
244 |
+
- **Step 0:** Prepare the config. We provide examplar ready-to-use configs [here](https://github.com/InternLM/POLAR/blob/main/examples/xtuner_configs/POLAR_7B_full_varlenattn_custom_dataset.py). If the provided configs cannot meet the requirements, please copy the provided config and do modification following the [xtuner guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/get_started/quickstart.md). For more details of reward model training settings, please see the xtuner [reward model guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/reward_model/modify_settings.md).
|
245 |
|
246 |
+
- **Step 1:** Start fine-tuning.
|
247 |
|
248 |
```shell
|
249 |
xtuner train ${CONFIG_FILE_PATH}
|
250 |
```
|
251 |
|
252 |
+
For example, you can start the fine-tuning of POLAR-7B-Base by
|
253 |
|
254 |
+
```shell
|
255 |
+
# On a single GPU
|
256 |
+
xtuner train /path/to/POLAR_7B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
|
257 |
|
258 |
+
# On multiple GPUs
|
259 |
+
NPROC_PER_NODE=${GPU_NUM} xtuner train /path/to/POLAR_7B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
|
260 |
+
```
|
261 |
|
262 |
+
Here, `--deepspeed` means using [DeepSpeed](https://github.com/microsoft/DeepSpeed) to optimize the training. Xtuner comes with several integrated strategies including ZeRO-1, ZeRO-2, and ZeRO-3. If you wish to disable this feature, simply remove this argument.
|
263 |
|
264 |
+
- **Step 2:** Convert the saved PTH model (if using DeepSpeed, it will be a directory) to Hugging Face model, by
|
265 |
|
266 |
+
```shell
|
267 |
+
xtuner convert pth_to_hf ${CONFIG_FILE_PATH} ${PTH} ${SAVE_PATH}
|
268 |
+
```
|
269 |
|
270 |
# Examples
|
271 |
|
|
|
300 |
sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
|
301 |
|
302 |
for output, reward in sorted_res:
|
303 |
+
print(f"Output: {output}
|
304 |
+
Reward: {reward}
|
305 |
+
")
|
306 |
```
|
307 |
|
308 |
```txt
|
|
|
356 |
sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
|
357 |
|
358 |
for output, reward in sorted_res:
|
359 |
+
print(f"Output: {output}
|
360 |
+
Reward: {reward}
|
361 |
+
")
|
362 |
```
|
363 |
|
364 |
```txt
|