Qwen3-4B GRPO Fine-tune on GSM8k (Unsloth)
This is a Qwen/Qwen3-4B
model fine-tuned on a 10% subset of the GSM8k dataset using GRPO (Grounded Reward Preference Optimization). The training was accelerated using the Unsloth library.
The model was trained to solve grade-school math problems by following a strict XML-like format for reasoning and providing a final answer.
Model Details
- Base Model:
Qwen/Qwen3-4B
- Fine-tuning Method: GRPO (via
trl.GRPOTrainer
) - Framework: Unsloth for LoRA and performance optimization
- Dataset: A 10% subset of
openai/gsm8k
. - Language: English
How to Use
This is a LoRA adapter and must be loaded on top of the base model using Unsloth for the best performance and compatibility.
import torch
from unsloth import FastLanguageModel
# Load the base model with Unsloth, using the same settings as training
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="Qwen/Qwen3-4B",
max_seq_length=2048,
load_in_4bit=True,
dtype=torch.bfloat16, # Use bfloat16 for inference
)
# Load the LoRA adapter from the Hub
model = FastLanguageModel.from_pretrained(
model=model,
model_name="tahamajs/Qwen3-4B-GSM8k-GRPO-Unsloth", # Your Hub model ID
)
# --- Define Prompt and Generate ---
SYSTEM_PROMPT = (
"You are a helpful assistant.\n"
"First think through the problem, then provide the answer.\n"
"Use this strict format:\n"
"<reasoning>\n"
"your step-by-step reasoning here\n"
"</reasoning>\n"
"<answer>\n"
"The final answer is [final_number].\n"
"</answer>\n"
)
question = "Natalia sold 48 liters of milk in the morning. In the afternoon, she sold 27 liters less than in the morning. In the evening, she sold 15 liters more than in the afternoon. How many liters of milk did she sell in total?"
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": question},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512, use_cache=True)
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
# Extract only the generated part
generated_part = response.split("<|im_start|>assistant\n")[-1]
print(generated_part)
Training Procedure
The model was trained using trl
's GRPOTrainer
.
- Reward Function: The training used a correctness-based reward (
r_correctness
) with a weight of2.0
. - GRPO Beta (
β
):0.01
- LoRA Rank (
r
):16
- Learning Rate:
0.0005
- Batch Size:
9
- Number of Generations (
k
):3
- Downloads last month
- 2
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
1
Ask for provider support