Qwen3-4B — GRPO LoRA (Unsloth Adapter)
A lightweight LoRA adapter for Qwen3-4B trained with Group Relative Policy Optimization (GRPO) using TRL and Unsloth.This adapter improves reasoning-style text generation (step-by-step answers, math/programming explanations) while remaining fast and memory-efficient thanks to 4-bit quantization for base weights.
TL;DR: Plug this adapter into
unsloth/qwen3-4b-unsloth-bnb-4bit
(or any compatible Qwen3-4B checkpoint), and you get a GRPO-tuned reasoning model that runs comfortably on a single consumer GPU.
Model Details
Model Description
- Base model:
unsloth/qwen3-4b-unsloth-bnb-4bit
(Qwen3-4B with 4-bit quantization for faster/cheaper training & inference) - Adapter type: LoRA (via 🤗 PEFT)
- Training objective: GRPO (policy optimization for group-comparative preferences/rewards)
- Intended style: Helpful, step-by-step, reasoning-focused generation
- Quantization: Base weights in 4-bit (NF4) via bitsandbytes; LoRA adapter in full precision
- Context length: 2k tokens by default (inherits from base; adjust as you like)
Developed by: Your name or org here
Shared by: Hugging Face handle here
Finetuned from: unsloth/qwen3-4b-unsloth-bnb-4bit
License: Inherits the base Qwen3 license; add an adapter license (e.g., Apache-2.0) if desired
Model Sources
- Repository: Add your HF repo URL here
- Paper (optional): Link to blog/paper describing your GRPO recipe
- Demo (optional): Spaces/Colab link if available
Uses
Direct Use
- Reasoning-style text generation (math explanations, step-by-step solutions, debugging hints)
- General Q&A, tutoring-like explanations
- Chain-of-thought style output when user explicitly asks (do not disclose hidden private reasoning in production apps)
Downstream Use
- Further preference optimization (DPO/GRPO/RLHF) on a target domain
- Task adapters stacked via PEFT
Out-of-Scope Use
- Unsafe content generation (hate, harassment, illegal activities)
- High-stakes decision making without human oversight
- Factual tasks requiring up-to-the-minute knowledge (the model has no browsing by default)
Safety note: Always apply your own content filters and human review in production settings.
Bias, Risks, and Limitations
- The adapter inherits biases and limitations of the base model and training data.
- GRPO-tuned behavior may over-index on verbose step-by-step explanations.
- The model can hallucinate facts or produce misleading justifications.
- Performance on non-English text is not guaranteed.
Recommendations
- Keep a human-in-the-loop for critical use cases.
- Add guardrails (toxicity filters, refusal policies).
- Evaluate on domain-specific benchmarks before deployment.
How to Get Started
1) Load with vanilla 🤗 Transformers + PEFT
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch
base_id = "unsloth/qwen3-4b-unsloth-bnb-4bit" # Base 4-bit model
adapter_id = "your-username/your-adapter-repo" # <- replace with this repo id
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
tok = AutoTokenizer.from_pretrained(base_id, use_fast=True)
base = AutoModelForCausalLM.from_pretrained(
base_id,
quantization_config=bnb_config,
device_map="auto"
)
model = PeftModel.from_pretrained(base, adapter_id).eval()
prompt = "Explain why the derivative of x^2 is 2x, step by step."
inputs = tok(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
do_sample=True
)
print(tok.decode(out[0], skip_special_tokens=True))
2) Load with Unsloth's FastLanguageModel
from unsloth import FastLanguageModel
import torch
base_id = "unsloth/qwen3-4b-unsloth-bnb-4bit"
adapter_id = "your-username/your-adapter-repo"
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=base_id,
max_seq_length=2048,
load_in_4bit=True,
dtype=torch.bfloat16,
)
model = FastLanguageModel.from_pretrained(model=model, model_name=adapter_id) # attach LoRA
model.eval()
prompt = "List three key differences between GRPO and PPO."
inp = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(**inp, max_new_tokens=256, temperature=0.7, top_p=0.9, do_sample=True)
print(tokenizer.decode(out[0], skip_special_tokens=True))
3) Chat-style prompting (Qwen template)
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
base_id = "unsloth/qwen3-4b-unsloth-bnb-4bit"
adapter_id = "your-username/your-adapter-repo"
tok = AutoTokenizer.from_pretrained(base_id)
base = AutoModelForCausalLM.from_pretrained(base_id, device_map="auto", load_in_4bit=True)
model = PeftModel.from_pretrained(base, adapter_id).eval()
messages = [
{"role": "system", "content": "You are a helpful AI assistant specialized in step-by-step reasoning."},
{"role": "user", "content": "Solve: If x + y = 10 and x - y = 2, what is x * y? Show the steps."},
]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok([text], return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7, top_p=0.9, do_sample=True)
print(tok.decode(outputs[0], skip_special_tokens=True))
Training Details
Data
Use any preference or reward style data suitable for GRPO (e.g., verified solutions vs. distractors).Example recipe (replace with your actual sources):
HuggingFaceH4/aime_2024
for math-style prompts- Your curated reasoning/chat data
Please attach dataset cards or links, and ensure you have the right to use and share the data.
Procedure
- Algorithm: GRPO via 🤗 TRL
- Base: 4-bit Qwen3-4B (NF4); train only LoRA parameters
- LoRA config (example):
r = 16–64
,alpha = 16–64
,dropout = 0.05
- target modules:
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
(adjust to taste)
- Sequence length: 2k (can extend if you reinit rotary / use rope scaling with a compatible base)
- Batching (example):
- Per-device batch size: 12
- Gradient accumulation: 2
- Effective batch size: 24
- Optimization (example):
- Optimizer: AdamW (β1=0.9, β2=0.999, weight_decay=0.1)
- LR: 1e-5 to 3e-5 (warmup 3–5%)
- Scheduler: cosine
- Mixed precision: bfloat16 compute with 4-bit base weights
- Steps/epochs: Adjust to your dataset size; start small (1–3 epochs) and monitor stability
Tip: If loss oscillates, lower LR, increase group size, and ensure reward normalization/stability in your GRPO config.
Example: TRL GRPO skeleton
from trl import GRPOConfig, GRPOTrainer
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
base_id = "unsloth/qwen3-4b-unsloth-bnb-4bit"
bnb = BitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(base_id, quantization_config=bnb, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(base_id, use_fast=True)
peft_config = LoraConfig(
r=32, lora_alpha=32, lora_dropout=0.05, bias="none",
task_type="CAUSAL_LM",
target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
)
grpo_config = GRPOConfig(
learning_rate=1e-5,
beta=0.01,
group_size=4,
per_device_train_batch_size=12,
gradient_accumulation_steps=2,
num_train_epochs=3,
logging_steps=10,
)
trainer = GRPOTrainer(
model=model,
args=grpo_config,
processing_class=tokenizer,
peft_config=peft_config,
train_dataset=your_train_dataset, # replace
reward_funcs=[your_reward_fn], # replace
)
trainer.train()
trainer.save_model("grpo-lora-adapter")
Speeds, Sizes, Times (example env)
- GPU: NVIDIA T4 / A10 / 4090-class works well
- Transformers: 4.55.0
- PEFT: 0.17.0
- TRL: 0.9+
- Torch: 2.4+
- Checkpoint size: Adapter typically tens to hundreds of MB (depends on r/targets)
Evaluation
Data, Factors & Metrics
- Data: Use domain-relevant sets (e.g., GSM8K, AIME-style sets) aligned with your goals
- Metrics: exact match / pass@1 for math; ROUGE/BLEU for summaries; win-rate vs. baseline for preference tasks
- Factors: prompt style, temperature, sampling settings, chain-of-thought visibility
Results
- Provide your benchmark table here once available. PRs with community evals are welcome.
Summary
- Qualitatively, model provides more structured, step-by-step answers than the base.
- Quantitative evaluation pending.
Environmental Impact
Estimate with ML CO2 Impact.
- Hardware: e.g., 1×T4 for 2 hours
- Cloud: e.g., GCP us-central1
- Energy/CO₂: Report if tracked
Technical Specifications
Architecture & Objective
- Qwen3-4B decoder-only transformer
- GRPO objective over grouped responses to the same prompt
- LoRA adapter on attention and MLP projections
Compute
- Hardware: single-GPU fine-tuning feasible (>=12–16 GB VRAM with 4-bit base)
- Software: PyTorch, Transformers, TRL, PEFT, bitsandbytes, Unsloth
Citation
If you use this work, please cite Qwen and TRL; cite your own report/blog if available.
@software{qwen3_2024,
title={Qwen3 Language Models},
author={Qwen Team},
year={2024},
url={https://huggingface.co/Qwen}
}
@misc{trl_library,
title={{TRL}: Transformer Reinforcement Learning},
author={von Werra, L. and others},
year={2023},
howpublished={\url{https://github.com/huggingface/trl}}
}
Glossary
- LoRA: Low-Rank Adapters enabling parameter-efficient fine-tuning.
- GRPO: Group Relative Policy Optimization — an RL-style objective using grouped responses and relative rewards.
- NF4: NormalFloat4 quantization format (bitsandbytes) for efficient 4-bit base weights.
More Information
- Replace placeholders with your repo id, dataset links, and any blog/demo.
- Add a separate
LICENSE
file if you want a distinct license for the adapter.
Model Card Authors
Taha Majlesi
Contact
https://www.linkedin.com/in/tahamajlesi/ https://github.com/tahamajs
Framework versions
- Transformers 4.55.0
- TRL 0.9+
- PEFT 0.17.0
- PyTorch 2.4+
- bitsandbytes 0.43+
- Unsloth 2025.x
- Downloads last month
- 28