|
--- |
|
license: apache-2.0 |
|
base_model: meta-llama/Llama-3.2-1B-Instruct |
|
tags: |
|
- dpo |
|
- lora |
|
- peft |
|
- llama-3.2 |
|
- iterative-dpo |
|
- self-rewarding |
|
library_name: peft |
|
--- |
|
|
|
# Iterative DPO Fine-Tune of Llama-3.2-1B (Iteration 2) |
|
|
|
This repository contains the LoRA adapters from the **second and final iteration** of a Direct Preference Optimization (DPO) fine-tuning process on the `meta-llama/Llama-3.2-1B-Instruct` model. |
|
|
|
This model represents a further refinement of the Iteration 1 model, demonstrating a self-improvement loop where the model learns from preferences on its own generated outputs. This work was inspired by the "Self-Rewarding Language Models" paper. |
|
|
|
- **Repository for Iteration 1:** [NilayR/llama32-iterative-dpo-iter1](https://huggingface.co/NilayR/llama32-iterative-dpo-iter1) |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
This model is the result of the second fine-tuning cycle in an iterative DPO pipeline. The process began with the model from Iteration 1 generating a new set of responses. These responses were then evaluated by an LLM Judge (GPT-3.5-Turbo) to create a fresh preference dataset. This new dataset was used to further fine-tune the model, resulting in the adapters contained in this repository. |
|
|
|
The goal of this iteration was to demonstrate that the model could continue to improve its alignment with desired behaviors (accuracy, helpfulness, clarity) using its own outputs as a foundation for learning. |
|
|
|
- **Developed by:** NilayR |
|
- **Model type:** Causal Language Model |
|
- **Language(s):** English |
|
- **License:** apache-2.0 |
|
- **Finetuned from model:** `meta-llama/Llama-3.2-1B-Instruct` (with adapters from Iteration 1) |
|
|
|
## How to Get Started with the Model |
|
|
|
To use these LoRA adapters, load the base model (`meta-llama/Llama-3.2-1B-Instruct`) and then apply the adapters from this repository. |
|
|
|
```python |
|
import torch |
|
from peft import PeftModel |
|
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig |
|
|
|
# Set base model ID and adapter path |
|
base_model_id = "meta-llama/Llama-3.2-1B-Instruct" |
|
adapter_id = "NilayR/llama32-iterative-dpo-iter2" |
|
|
|
# Configure BitsAndBytes for 4-bit quantization |
|
bnb_config = BitsAndBytesConfig( |
|
load_in_4bit=True, |
|
bnb_4bit_quant_type="nf4", |
|
bnb_4bit_compute_dtype=torch.bfloat16 |
|
) |
|
|
|
# Load the base model with quantization |
|
base_model = AutoModelForCausalLM.from_pretrained( |
|
base_model_id, |
|
quantization_config=bnb_config, |
|
device_map="auto", |
|
trust_remote_code=True, |
|
) |
|
|
|
# Load the tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained(base_model_id) |
|
tokenizer.pad_token = tokenizer.eos_token |
|
|
|
# Load and apply the PEFT adapters |
|
model = PeftModel.from_pretrained(base_model, adapter_id) |
|
|
|
# --- Generate a response --- |
|
prompt = "What are the key benefits of meditation?" |
|
messages = [ |
|
{"role": "system", "content": "You are a helpful assistant."}, |
|
{"role": "user", "content": prompt} |
|
] |
|
|
|
input_ids = tokenizer.apply_chat_template( |
|
messages, |
|
add_generation_prompt=True, |
|
return_tensors="pt" |
|
).to(model.device) |
|
|
|
outputs = model.generate( |
|
input_ids, |
|
max_new_tokens=200, |
|
do_sample=True, |
|
temperature=0.7, |
|
top_p=0.95 |
|
) |
|
|
|
response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
print(response.split("assistant")[-1].strip()) |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
The model was trained on a preference dataset generated by the **Iteration 1 model** (`NilayR/llama32-iterative-dpo-iter1`). |
|
|
|
* **Data Generation Process:** |
|
1. **Instructions:** The model from Iteration 1 generated responses to 20 instructions from the LIMA dataset. |
|
2. **Preference Labeling:** A custom LLM Judge powered by `GPT-3.5-Turbo` evaluated pairs of the new responses, creating a dataset of **57 chosen/rejected pairs**. |
|
|
|
### Training Procedure |
|
|
|
The model was trained for one epoch using the TRL library's `DPOTrainer`. |
|
|
|
#### Training Hyperparameters |
|
|
|
* **Framework:** `trl.DPOTrainer` |
|
* **Epochs:** 1 |
|
* **Batch Size:** 1 |
|
* **Gradient Accumulation Steps:** 2 (Effective Batch Size: 2) |
|
* **Optimizer:** `paged_adamw_8bit` |
|
* **Learning Rate:** 2e-5 |
|
* **DPO Beta (β):** 0.1 |
|
* **Max Steps:** 50 |
|
* **Final Training Loss:** `0.6343` |
|
|
|
#### LoRA Configuration |
|
|
|
* **Rank (`r`):** 16 |
|
* **Alpha (`lora_alpha`):** 32 |
|
* **Target Modules:** `q_proj`, `k_proj`, `v_proj`, `o_proj` |
|
* **Dropout:** 0.05 |
|
|
|
### Compute Infrastructure |
|
|
|
* **Hardware:** 1x NVIDIA A100 40GB GPU |
|
* **Cloud Provider:** Google Colab |
|
* **Software:** `transformers`, `peft`, `trl`, `bitsandbytes` |
|
|
|
----- |
|
|
|
|
|
``` |
|
``` |