Qwen3-4B GRPO Fine-tune on GSM8k (Unsloth)

This is a Qwen/Qwen3-4B model fine-tuned on a 10% subset of the GSM8k dataset using GRPO (Grounded Reward Preference Optimization). The training was accelerated using the Unsloth library.

The model was trained to solve grade-school math problems by following a strict XML-like format for reasoning and providing a final answer.

Model Details

Base Model: Qwen/Qwen3-4B
Fine-tuning Method: GRPO (via trl.GRPOTrainer)
Framework: Unsloth for LoRA and performance optimization
Dataset: A 10% subset of openai/gsm8k.
Language: English

How to Use

This is a LoRA adapter and must be loaded on top of the base model using Unsloth for the best performance and compatibility.

import torch
from unsloth import FastLanguageModel

# Load the base model with Unsloth, using the same settings as training
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen3-4B",
    max_seq_length=2048,
    load_in_4bit=True,
    dtype=torch.bfloat16, # Use bfloat16 for inference
)

# Load the LoRA adapter from the Hub
model = FastLanguageModel.from_pretrained(
    model=model,
    model_name="tahamajs/Qwen3-4B-GSM8k-GRPO-Unsloth", # Your Hub model ID
)

# --- Define Prompt and Generate ---
SYSTEM_PROMPT = (
    "You are a helpful assistant.\n"
    "First think through the problem, then provide the answer.\n"
    "Use this strict format:\n"
    "<reasoning>\n"
    "your step-by-step reasoning here\n"
    "</reasoning>\n"
    "<answer>\n"
    "The final answer is [final_number].\n"
    "</answer>\n"
)

question = "Natalia sold 48 liters of milk in the morning. In the afternoon, she sold 27 liters less than in the morning. In the evening, she sold 15 liters more than in the afternoon. How many liters of milk did she sell in total?"

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": question},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512, use_cache=True)
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

# Extract only the generated part
generated_part = response.split("<|im_start|>assistant\n")[-1]
print(generated_part)

Training Procedure

The model was trained using trl's GRPOTrainer.

Reward Function: The training used a correctness-based reward (r_correctness) with a weight of 2.0.
GRPO Beta (β): 0.01
LoRA Rank (r): 16
Learning Rate: 0.0005
Batch Size: 9
Number of Generations (k): 3

tahamajs
/

Qwen3-4B-GSM8k-GRPO-Unsloth

Qwen3-4B GRPO Fine-tune on GSM8k (Unsloth)

Model Details

How to Use

Training Procedure

Model tree for tahamajs/Qwen3-4B-GSM8k-GRPO-Unsloth