Update README.md
Browse files
README.md
CHANGED
|
@@ -28,7 +28,27 @@ In order to fine-tune [`mistralai/Mistral-7B-v0.1`](https://huggingface.co/mistr
|
|
| 28 |
`orpo` from [🤗`trl`](https://github.com/huggingface/trl) has been used, thanks to the invaluable and quick contribution of
|
| 29 |
@kashif.
|
| 30 |
|
| 31 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
## About the dataset
|
| 34 |
|
|
|
|
| 28 |
`orpo` from [🤗`trl`](https://github.com/huggingface/trl) has been used, thanks to the invaluable and quick contribution of
|
| 29 |
@kashif.
|
| 30 |
|
| 31 |
+
ORPO stands for Odds Ratio Preference Optimization, and defines a new paradigm on fine-tuning LLMs, “combining” both the SFT
|
| 32 |
+
and the PPO/DPO stage into a single stage, thanks to the proposed loss function starting off from a preference dataset i.e.
|
| 33 |
+
chosen-rejected pairs.
|
| 34 |
+
|
| 35 |
+
Some key features about ORPO:
|
| 36 |
+
- ⚡️ Faster to train as it’s now a single stage fine-tuning
|
| 37 |
+
- 👨🏻🏫 Requires preference data i.e. (prompt, chosen, rejected)-like datasets
|
| 38 |
+
- ⬇️ Less memory than PPO/DPO as doesn’t need a reference model
|
| 39 |
+
- 🏆 SOTA results for Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) when fine-tuned using single-turn UltraFeedback
|
| 40 |
+
|
| 41 |
+
Some notes on the experiments mentioned in the paper:
|
| 42 |
+
- 📌 Up to 7B parameter LLMs were fine-tuned, achieving better performance compared to 7B counterparts and even 13B LLMs
|
| 43 |
+
- 📌 Not yet trained with multi-turn datasets as Capybara (may be an interesting experiment to run)
|
| 44 |
+
- 📌 For OPT models fine-tuned with HH-RLHF from Anthropic, truncated and padded to 1024 tokens, filtering out filtering the prompts with > 1024 tokens
|
| 45 |
+
- 📌 For Phi-2, Mistral (7B) and Llama 2 (7B), or UltraFeedback from OpenBMB (truncated and padded to 2048 tokens), filtering out filtering the prompts with > 1024 tokens
|
| 46 |
+
- 📌 Fine-tuned for 10 epochs, and using the evaluation loss as the metric for selecting the best models
|
| 47 |
+
|
| 48 |
+
For more information about ORPO, I highly recommend reading their paper titled [`ORPO: Monolithic Preference Optimization without Reference Model`](https://huggingface.co/papers/2403.07691),
|
| 49 |
+
as it contains a lot of information and details not only on the ORPO method, but also on the experiment they ran, the results they got, and much more.
|
| 50 |
+
|
| 51 |
+
📅 Fine-tuning code will be shared soon, stay tuned!
|
| 52 |
|
| 53 |
## About the dataset
|
| 54 |
|