Add pipeline tag and library name to model card

This PR improves the model card by adding the `pipeline_tag: text-ranking` to enhance discoverability on the Hugging Face Hub, ensuring users can find your model at https://huggingface.co/models?pipeline_tag=text-ranking. It also adds `library_name: transformers` to correctly associate the model with the Transformers library, enabling the "how to use" code snippets on the model page.

Files changed (1) hide show

README.md +44 -37

README.md CHANGED Viewed

@@ -1,16 +1,19 @@
 ---
-license: apache-2.0
 language:
 - en
 - zh
-base_model:
-- internlm/internlm2_5-7b
 tags:
 - Reward
 - RL
 - RFT
 - Reward Model
 ---
 <div align="center">
 <img src="./misc/logo.png" width="400"/><br>
@@ -35,13 +38,13 @@ tags:
 POLAR represents a significant breakthrough in scalar-based reward models achieved through large-scale pre-training. It leverages the innovative **POL**icy Discrimin**A**tive Lea**R**ning (**POLAR**) paradigm——a scalable, high-level optimization objective——to effectively discriminate between policies using a large-scale synthetic corpora. Following pre-training, POLAR RMs are fine-tuned with minimal preference data, rapidly aligning with human preferences. Key features of POLAR include:
-* **Innovative Pre-training Paradigm:**  POLAR trains a reward model to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between two policies, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships.
-* **Tailored for Reinforcement Fine-tuning:** POLAR assigns rewards to LLM trajectories based on given references, perfectly aligning with the Reinforcement Fine-tuning (RFT) framework. POLAR provides a promising solution for applying RFT in generic scenarios.
-* **Superior Performance and Generalization:** POLAR achieves state-of-the-art results on downstream reinforcement learning tasks, consistently delivering accurate and reliable reward signals that generalize effectively to unseen scenarios and significantly reducing reward hacking.
-* **Easy to Customize:** Pre-trained checkpoints of POLAR are available, enabling researchers to conveniently fine-tune the RM for various customized scenarios, thus facilitating straightforward adaptation and expansion tailored to specific applications and experimental requirements.
 <img src="./misc/intro.jpeg"/><br>
@@ -60,18 +63,18 @@ We conducted a comprehensive evaluation of POLAR-7B via the Proximal Policy Opti
 You could employ the latest [xtuner](https://github.com/InternLM/xtuner) to fine-tune and use POLAR. Xtuner is an efficient, flexible and full-featured toolkit for fine-tuning LLMs.
-- It is recommended to build a Python-3.10 virtual environment using conda
-  ```bash
-  conda create --name xtuner-env python=3.10 -y
-  conda activate xtuner-env
-  ```
-- Install xtuner via pip
-  ```shell
-  pip install 'git+https://github.com/InternLM/xtuner.git@main#egg=xtuner[deepspeed]'
-  ```
 ## Inference
@@ -128,7 +131,7 @@ print(rewards)
 #### Requirements
-- lmdeploy >= 0.9.1
 #### Server Launch
@@ -158,7 +161,7 @@ print(rewards)
 #### Requirements
-- 0.4.3.post4 <= sglang <= 0.4.4.post1
 #### Server Launch
@@ -189,7 +192,7 @@ print(rewards)
 #### Requirements
-- vllm >= 0.8.0
 #### Server Launch
@@ -220,8 +223,8 @@ print(rewards)
 ### Requirements
-- flash_attn
-- tensorboard
 ### Data format
@@ -238,31 +241,31 @@ Unlike traditional reward models, POLAR requires an additional reference traject
 ### Training steps
-- **Step 0:** Prepare the config. We provide examplar ready-to-use configs [here](https://github.com/InternLM/POLAR/blob/main/examples/xtuner_configs/POLAR_7B_full_varlenattn_custom_dataset.py). If the provided configs cannot meet the requirements, please copy the provided config and do modification following the [xtuner guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/get_started/quickstart.md). For more details of reward model training settings, please see the xtuner [reward model guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/reward_model/modify_settings.md).
-- **Step 1:** Start fine-tuning.
     ```shell
     xtuner train ${CONFIG_FILE_PATH}
     ```
-  For example, you can start the fine-tuning of POLAR-7B-Base by
-  ```shell
-  # On a single GPU
-  xtuner train /path/to/POLAR_7B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
-  # On multiple GPUs
-  NPROC_PER_NODE=${GPU_NUM} xtuner train /path/to/POLAR_7B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
-  ```
-  Here, `--deepspeed` means using [DeepSpeed](https://github.com/microsoft/DeepSpeed) to optimize the training. Xtuner comes with several integrated strategies including ZeRO-1, ZeRO-2, and ZeRO-3. If you wish to disable this feature, simply remove this argument.
-- **Step 2:** Convert the saved PTH model (if using DeepSpeed, it will be a directory) to Hugging Face model, by
-  ```shell
-  xtuner convert pth_to_hf ${CONFIG_FILE_PATH} ${PTH} ${SAVE_PATH}
-  ```
 # Examples
@@ -297,7 +300,9 @@ rewards = client(data)
 sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
 for output, reward in sorted_res:
-    print(f"Output: {output}\nReward: {reward}\n")
 ```
 ```txt
@@ -351,7 +356,9 @@ rewards = client(data)
 sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
 for output, reward in sorted_res:
-    print(f"Output: {output}\nReward: {reward}\n")
 ```
 ```txt

 ---
+base_model:
+- internlm/internlm2_5-7b
 language:
 - en
 - zh
+license: apache-2.0
 tags:
 - Reward
 - RL
 - RFT
 - Reward Model
+pipeline_tag: text-ranking
+library_name: transformers
 ---
 <div align="center">
 <img src="./misc/logo.png" width="400"/><br>
 POLAR represents a significant breakthrough in scalar-based reward models achieved through large-scale pre-training. It leverages the innovative **POL**icy Discrimin**A**tive Lea**R**ning (**POLAR**) paradigm——a scalable, high-level optimization objective——to effectively discriminate between policies using a large-scale synthetic corpora. Following pre-training, POLAR RMs are fine-tuned with minimal preference data, rapidly aligning with human preferences. Key features of POLAR include:
+*   **Innovative Pre-training Paradigm:** POLAR trains a reward model to discern identical policies and discriminate different ones. Unlike traditional reward modeling methods relying on absolute preferences, POLAR captures the relative difference between two policies, which is a scalable, high-level optimization objective suitable for modeling generic ranking relationships.
+*   **Tailored for Reinforcement Fine-tuning:** POLAR assigns rewards to LLM trajectories based on given references, perfectly aligning with the Reinforcement Fine-tuning (RFT) framework. POLAR provides a promising solution for applying RFT in generic scenarios.
+*   **Superior Performance and Generalization:** POLAR achieves state-of-the-art results on downstream reinforcement learning tasks, consistently delivering accurate and reliable reward signals that generalize effectively to unseen scenarios and significantly reducing reward hacking.
+*   **Easy to Customize:** Pre-trained checkpoints of POLAR are available, enabling researchers to conveniently fine-tune the RM for various customized scenarios, thus facilitating straightforward adaptation and expansion tailored to specific applications and experimental requirements.
 <img src="./misc/intro.jpeg"/><br>
 You could employ the latest [xtuner](https://github.com/InternLM/xtuner) to fine-tune and use POLAR. Xtuner is an efficient, flexible and full-featured toolkit for fine-tuning LLMs.
+-   It is recommended to build a Python-3.10 virtual environment using conda
+    ```bash
+    conda create --name xtuner-env python=3.10 -y
+    conda activate xtuner-env
+    ```
+-   Install xtuner via pip
+    ```shell
+    pip install 'git+https://github.com/InternLM/xtuner.git@main#egg=xtuner[deepspeed]'
+    ```
 ## Inference
 #### Requirements
+-   lmdeploy >= 0.9.1
 #### Server Launch
 #### Requirements
+-   0.4.3.post4 <= sglang <= 0.4.4.post1
 #### Server Launch
 #### Requirements
+-   vllm >= 0.8.0
 #### Server Launch
 ### Requirements
+-   flash_attn
+-   tensorboard
 ### Data format
 ### Training steps
+-   **Step 0:** Prepare the config. We provide examplar ready-to-use configs [here](https://github.com/InternLM/POLAR/blob/main/examples/xtuner_configs/POLAR_7B_full_varlenattn_custom_dataset.py). If the provided configs cannot meet the requirements, please copy the provided config and do modification following the [xtuner guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/get_started/quickstart.md). For more details of reward model training settings, please see the xtuner [reward model guideline](https://github.com/InternLM/xtuner/blob/main/docs/en/reward_model/modify_settings.md).
+-   **Step 1:** Start fine-tuning.
     ```shell
     xtuner train ${CONFIG_FILE_PATH}
     ```
+    For example, you can start the fine-tuning of POLAR-7B-Base by
+    ```shell
+    # On a single GPU
+    xtuner train /path/to/POLAR_7B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
+    # On multiple GPUs
+    NPROC_PER_NODE=${GPU_NUM} xtuner train /path/to/POLAR_7B_full_varlenattn_custom_dataset.py --deepspeed deepspeed_zero2
+    ```
+    Here, `--deepspeed` means using [DeepSpeed](https://github.com/microsoft/DeepSpeed) to optimize the training. Xtuner comes with several integrated strategies including ZeRO-1, ZeRO-2, and ZeRO-3. If you wish to disable this feature, simply remove this argument.
+-   **Step 2:** Convert the saved PTH model (if using DeepSpeed, it will be a directory) to Hugging Face model, by
+    ```shell
+    xtuner convert pth_to_hf ${CONFIG_FILE_PATH} ${PTH} ${SAVE_PATH}
+    ```
 # Examples
 sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
 for output, reward in sorted_res:
+    print(f"Output: {output}
+Reward: {reward}
+")
 ```
 ```txt
 sorted_res = sorted(zip(outputs, rewards), key=lambda x: x[1], reverse=True)
 for output, reward in sorted_res:
+    print(f"Output: {output}
+Reward: {reward}
+")
 ```
 ```txt