File size: 16,208 Bytes
62ce38e 4722dc7 62ce38e 4722dc7 62ce38e d23e69a 4722dc7 62ce38e 4722dc7 7d1554e 22287a0 7d1554e 4722dc7 7d1554e 4722dc7 7d1554e 4722dc7 7d1554e 4722dc7 7d1554e 22287a0 7d1554e 4722dc7 7d1554e 4722dc7 7d1554e 4722dc7 7d1554e f7c23f9 4722dc7 7d1554e 4722dc7 7d1554e 4722dc7 7d1554e 4722dc7 7d1554e 4722dc7 7d1554e 4722dc7 7d1554e 4722dc7 7d1554e 4722dc7 7d1554e 4722dc7 7d1554e 4722dc7 7d1554e 4722dc7 7d1554e 4722dc7 7d1554e 4722dc7 7d1554e 4722dc7 7d1554e 4722dc7 7d1554e 22287a0 62ce38e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 |
---
base_model:
- internlm/internlm2_5-7b
language:
- en
- zh
license: apache-2.0
tags:
- Reward
- RL
- RFT
- Reward Model
pipeline_tag: text-classification
library_name: transformers
---
<div align="center">
<img src="./misc/logo.png" width="400"/><br>
[](./LICENSE)
[](https://github.com/InternLM/xtuner/)
[](https://github.com/InternLM/lmdeploy/)
[](https://github.com/sgl-project/sglang/)
[](https://github.com/vllm-project/vllm/)
[💻 Github](https://github.com/InternLM/POLAR) |
[📜 Paper](https://arxiv.org/abs/2507.05197)<br>
[English](./README.md) |
[简体中文](./README_zh-CN.md)
</div>
# Introduction
POLAR represents a significant breakthrough in scalar-based reward models achieved through large-scale pre-training. It leverages the innovative **POL**icy Discrimin**A**tive Lea**R**ning (**POLAR**) paradigm——a scalable, high-level optimization objective——to effectively discriminate between policies using a large-scale synthetic corpora. Following pre-training, POLAR RMs are fine-tuned with minimal preference data, rapidly aligning with human preferences. Key features of POLAR include:
* **Innovative Pre-training Paradigm:** POLAR trains a reward model to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between two policies, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships.
* **Tailored for Reinforcement Fine-tuning:** POLAR assigns rewards to LLM trajectories based on given references, perfectly aligning with the Reinforcement Fine-tuning (RFT) framework. POLAR provides a promising solution for applying RFT in generic scenarios.
* **Superior Performance and Generalization:** POLAR achieves state-of-the-art results on downstream reinforcement learning tasks, consistently delivering accurate and reliable reward signals that generalize effectively to unseen scenarios and significantly reducing reward hacking.
* **Easy to Customize:** Pre-trained checkpoints of POLAR are available, enabling researchers to conveniently fine-tune the RM for various customized scenarios, thus facilitating straightforward adaptation and expansion tailored to specific applications and experimental requirements.
<img src="./misc/intro.jpeg"/><br>
# POLAR-7B-Base
**POLAR-7B-Base** refers to the pre-trained-only checkpoint, ideal for customized fine-tuning according to specific preferences. The "ready-to-use" checkpoint **POLAR-7B** has been already fine-tuned on general preference data, making it suitable for immediate use in most scenarios.
We conducted a comprehensive evaluation of POLAR-7B via the Proximal Policy Optimization (PPO) algorithm. We evaluate the downstream RL performances of four different policy models using [OpenCompass](https://github.com/internLM/OpenCompass/). More details are available in our [Paper](https://arxiv.org/abs/2507.05197).
<img src="./misc/result.png"/><br>
# Quick Start
## Installation
You could employ the latest [xtuner](https://github.com/InternLM/xtuner) to fine-tune and use POLAR. Xtuner is an efficient, flexible and full-featured toolkit for fine-tuning LLMs.
- It is recommended to build a Python-3.10 virtual environment using conda
```bash
conda create --name xtuner-env python=3.10 -y
conda activate xtuner-env
```
- Install xtuner via pip
```shell
pip install 'xtuner[deepspeed]'==0.2.0
```
- Install xtuner from the latest source code
```shell
pip install 'git+https://github.com/InternLM/xtuner.git@main#egg=xtuner[deepspeed]'
```
## Inference
We support reward inference through [lmdeploy](https://github.com/InternLM/lmdeploy/), [sglang](https://github.com/sgl-project/sglang/), and [vllm](https://github.com/vllm-project/vllm/). We recommend setting up a virtual environment with conda when using these inference engines to prevent potential dependency conflicts.
### Data format
Unlike traditional reward models, POLAR requires an additional reference trajectory as a demonstration and evaluate candidate trajectories by measuring their consistency with the provided reference.
```python
data = [
{
"prompt": [{"role": "user", "content": "What is the capital of China?"}],
"reference": [{"role": "assistant", "content": "Beijing."}],
"output": [{"role": "assistant", "content": "Beijing."}]
},
{
"prompt": [{"role": "user", "content": "What is the capital of China?"}],
"reference": [{"role": "assistant", "content": "Beijing."}],
"output": [{"role": "assistant", "content": "Shanghai."}]
}
]
```
### Inference with transformers
#### Reward request
To load the POLAR model using transformers, use the following code to get rewards:
```python
from transformers import AutoModel, AutoTokenizer
from xtuner.utils import RewardModelClient
model_name = 'internlm/POLAR-7B'
model = AutoModel.from_pretrained(
model_name,
device_map="cuda",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
client = RewardModelClient(model_name)
encoded_data = client.encode(data)
batch = tokenizer(encoded_data, return_tensors='pt', padding=True).to('cuda')
outputs = model(**batch)
rewards = outputs[0].squeeze(-1).cpu().tolist()
print(rewards)
```
### Inference with lmdeploy
[LMDeploy](https://github.com/InternLM/lmdeploy) is a toolkit for compressing, deploying, and serving LLMs.
#### Requirements
- lmdeploy >= 0.9.1
#### Server Launch
```bash
lmdeploy serve api_server internlm/POLAR-7B --backend pytorch --server-port 30000
```
#### Client Request
```python
from xtuner.utils import RewardModelClient
client = RewardModelClient("internlm/POLAR-7B",
server_type="lmdeploy",
server_address="127.0.0.1:30000")
# Request rewards directly
rewards = client(data)
print(rewards)
# First encode data and then get rewards via the request function.
encoded_data = client.encode(data)
rewards = client.lmdeploy_request_reward(encoded_data)
print(rewards)
```
### Inference with sglang
#### Requirements
- 0.4.3.post4 <= sglang <= 0.4.4.post1
#### Server Launch
```bash
python3 -m sglang.launch_server --model internlm/POLAR-7B --trust-remote-code --is-embedding --dp 4 --tp 2 --mem-fraction-static 0.9 --port 30000
```
#### Client Request
```python
from xtuner.utils import RewardModelClient
client = RewardModelClient("internlm/POLAR-7B",
server_type="sglang",
server_address="127.0.0.1:30000")
# Request rewards directly
rewards = client(data)
print(rewards)
# First encode data and then get rewards via the request function.
encoded_data = client.encode(data)
rewards = client.sglang_request_reward(encoded_data)
print(rewards)
```
### Inference with vllm
#### Requirements
- vllm >= 0.8.0
#### Server Launch
```bash
vllm serve internlm/POLAR-7B --task=reward --trust-remote-code --tensor-parallel-size=2 --port 30000
```
#### Client Request
```python
from xtuner.utils import RewardModelClient
client = RewardModelClient("internlm/POLAR-7B",
server_type="vllm",
server_address="127.0.0.1:30000")
# Request rewards directly
rewards = client(data)
print(rewards)
# First encode data and then get rewards via the request function.
encoded_data = client.encode(data)
rewards = client.vllm_request_reward(encoded_data)
print(rewards)
```
## Fine-tune
### Requirements
- flash_attn
- tensorboard
### Data format
Unlike traditional reward models, POLAR requires an additional reference trajectory as a demonstration during fine-tuning, along with a chosen trajectory and a rejected trajectory. You can construct your fine-tuning data in a `train.jsonl` file, formatted as follows:
```json
{
"prompt": [{"role": "user", "content": "What is the capital of China?"}],
"reference": [{"role": "assistant", "content": "Beijing."}],
"chosen": [{"role": "assistant", "content": "Beijing."}],
"rejected": [{"role": "assistant", "content": "Shanghai."}]
}
```
### Training steps
- **Step 0:** Prepare the config. We provide examplar ready-to-use configs [here](https://github.com/InternLM/POLAR/blob/main/examples/xtuner_configs/POLAR_7B_full_varlenattn_custom_dataset.py). If the provided configs cannot meet the requirements, please copy the provided config and do modification following the [xtuner guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/get_started/quickstart.md). For more details of reward model training settings, please see the xtuner [reward model guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/reward_model/modify_settings.md).
- **Step 1:** Start fine-tuning.
```shell
xtuner train ${CONFIG_FILE_PATH}
```
For example, you can start the fine-tuning of POLAR-7B-Base by
```shell
# On a single GPU
xtuner train /path/to/POLAR_7B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
# On multiple GPUs
NPROC_PER_NODE=${GPU_NUM} xtuner train /path/to/POLAR_7B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
```
Here, `--deepspeed` means using [DeepSpeed](https://github.com/microsoft/DeepSpeed) to optimize the training. Xtuner comes with several integrated strategies including ZeRO-1, ZeRO-2, and ZeRO-3. If you wish to disable this feature, simply remove this argument.
- **Step 2:** Convert the saved PTH model (if using DeepSpeed, it will be a directory) to Hugging Face model, by
```shell
xtuner convert pth_to_hf ${CONFIG_FILE_PATH} ${PTH} ${SAVE_PATH}
```
# Examples
## Closed-ended questions
```python
from xtuner.utils import RewardModelClient
prompt = "How many 'r's are there in the word 'strawberry'?"
reference = "There are 3 'r's in the word 'strawberry'. Here's how we can count them: 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. So, the answer is 3."
outputs = [
# Same as the reference response.
"There are 3 'r's in the word 'strawberry'. Here's how we can count them: 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. So, the answer is 3.",
# Correct answer with correct thoughts.
"Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are three 'r's, so the answer is three.",
# Wrong answer with wrong thoughts.
"Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are two 'r's, so the answer is two.",
# Wrong answer with correct thoughts.
"Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are three 'r's, so the answer is two.",
# Correct answer with wrong thoughts.
"Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are two 'r's, so the answer is three.",
# Correct answer without thoughts.
"There are 3 'r's in the word 'strawberry'.",
# Wrong answer without thoughts.
"There are 2 'r's in the word 'strawberry'.",
]
data = [{"prompt": prompt, "reference": reference, "output": output} for output in outputs]
client = RewardModelClient("internlm/POLAR-7B", server_type="sglang", server_address="127.0.0.1:30000")
rewards = client(data)
sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
for output, reward in sorted_res:
print(f"Output: {output}
Reward: {reward}
")
```
```txt
Output: There are 3 'r's in the word 'strawberry'. Here's how we can count them: 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. So, the answer is 3.
Reward: 0.054595947265625
Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are three 'r's, so the answer is three.
Reward: -2.005859375
Output: There are 3 'r's in the word 'strawberry'.
Reward: -6.70703125
Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are two 'r's, so the answer is three.
Reward: -7.10546875
Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are three 'r's, so the answer is two.
Reward: -7.1328125
Output: Let's count the 'r's in 'strawberry': 's', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y'. There are two 'r's, so the answer is two.
Reward: -8.46875
Output: There are 2 'r's in the word 'strawberry'.
Reward: -10.8203125
```
## Open-ended questions
```python
from xtuner.utils import RewardModelClient
prompt = "Summarize the first book of Frank Herbert’s Dune in one witty short sentence."
reference = "Royal teen discovers that life’s a beach—minus the ocean, plus spice, giant sandworms and deadly politics."
outputs = [
# Same as the reference response.
"Royal teen discovers that life’s a beach—minus the ocean, plus spice, giant sandworms and deadly politics.",
# Closely resembles the reference response but includes factual errors.
"Royal teen discovers that life’s a beach—minus the ocean, plus magic, dark wizards and deadly politics.",
# A distinct yet concise and witty summary that draws analogies from other dramas—markedly different from the reference response.
"Young noble’s move to desert planet turns into galactic Game of Thrones with fewer dragons, more worms.",
# A concise summary, but lacking wit—fails to meet the requirement.
"A noble family’s fall sparks a young heir’s rise as a leader on a harsh desert planet governed by prophecy and survival.",
# A witty summary, but overly long—fails to meet the requirement.
"Paul Atreides loses his father, gains prophetic powers, learns to ride a sandworm, leads a holy war, and discovers that being the chosen one comes with a lot of blood, sand, and questionable decisions.",
# A concise and witty summary that draws from multiple Dune books rather than just the first—fails to follow the instruction.
"Boy gets planet, becomes god, loses soul — family drama ensues across galaxies."
]
data = [{"prompt": prompt, "reference": reference, "output": output} for output in outputs]
client = RewardModelClient("internlm/POLAR-7B", server_type="sglang", server_address="127.0.0.1:30000")
rewards = client(data)
sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
for output, reward in sorted_res:
print(f"Output: {output}
Reward: {reward}
")
```
```txt
Output: Royal teen discovers that life’s a beach—minus the ocean, plus spice, giant sandworms and deadly politics.
Reward: 0.466552734375
Output: Young noble’s move to desert planet turns into galactic Game of Thrones with fewer dragons, more worms.
Reward: -6.91796875
Output: Royal teen discovers that life’s a beach—minus the ocean, plus magic, dark wizards and deadly politics.
Reward: -7.70703125
Output: Paul Atreides loses his father, gains prophetic powers, learns to ride a sandworm, leads a holy war, and discovers that being the chosen one comes with a lot of blood, sand, and questionable decisions.
Reward: -8.4296875
Output: A noble family’s fall sparks a young heir’s rise as a leader on a harsh desert planet governed by prophecy and survival.
Reward: -8.6484375
Output: Boy gets planet, becomes god, loses soul — family drama ensues across galaxies.
Reward: -10.359375
```
# License
Code and model weights are licensed under Apache-2.0.
# Citation
```
@article{dou2025pretrained,
title={Pre-Trained Policy Discriminators are General Reward Models},
author={Dou, Shihan and Liu, Shichun and Yang, Yuming and Zou, Yicheng and Zhou, Yunhua and Xing, Shuhao and Huang, Chenhao and Ge, Qiming and Song, Demin and Lv, Haijun and others},
journal={arXiv preprint arXiv:2507.05197},
year={2025}
}
``` |