Qwen3-4B — GRPO LoRA (Unsloth Adapter)

A lightweight LoRA adapter for Qwen3-4B trained with Group Relative Policy Optimization (GRPO) using TRL and Unsloth.This adapter improves reasoning-style text generation (step-by-step answers, math/programming explanations) while remaining fast and memory-efficient thanks to 4-bit quantization for base weights.

TL;DR: Plug this adapter into unsloth/qwen3-4b-unsloth-bnb-4bit (or any compatible Qwen3-4B checkpoint), and you get a GRPO-tuned reasoning model that runs comfortably on a single consumer GPU.

Model Details

Model Description

Base model: unsloth/qwen3-4b-unsloth-bnb-4bit (Qwen3-4B with 4-bit quantization for faster/cheaper training & inference)
Adapter type: LoRA (via 🤗 PEFT)
Training objective: GRPO (policy optimization for group-comparative preferences/rewards)
Intended style: Helpful, step-by-step, reasoning-focused generation
Quantization: Base weights in 4-bit (NF4) via bitsandbytes; LoRA adapter in full precision
Context length: 2k tokens by default (inherits from base; adjust as you like)

Developed by: Your name or org here Shared by: Hugging Face handle here Finetuned from: unsloth/qwen3-4b-unsloth-bnb-4bit License: Inherits the base Qwen3 license; add an adapter license (e.g., Apache-2.0) if desired

Model Sources

Repository: Add your HF repo URL here
Paper (optional): Link to blog/paper describing your GRPO recipe
Demo (optional): Spaces/Colab link if available

Uses

Direct Use

Reasoning-style text generation (math explanations, step-by-step solutions, debugging hints)
General Q&A, tutoring-like explanations
Chain-of-thought style output when user explicitly asks (do not disclose hidden private reasoning in production apps)

Downstream Use

Further preference optimization (DPO/GRPO/RLHF) on a target domain
Task adapters stacked via PEFT

Out-of-Scope Use

Unsafe content generation (hate, harassment, illegal activities)
High-stakes decision making without human oversight
Factual tasks requiring up-to-the-minute knowledge (the model has no browsing by default)

Safety note: Always apply your own content filters and human review in production settings.

Bias, Risks, and Limitations

The adapter inherits biases and limitations of the base model and training data.
GRPO-tuned behavior may over-index on verbose step-by-step explanations.
The model can hallucinate facts or produce misleading justifications.
Performance on non-English text is not guaranteed.

Recommendations

Keep a human-in-the-loop for critical use cases.
Add guardrails (toxicity filters, refusal policies).
Evaluate on domain-specific benchmarks before deployment.

How to Get Started

1) Load with vanilla 🤗 Transformers + PEFT

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch

base_id = "unsloth/qwen3-4b-unsloth-bnb-4bit"   # Base 4-bit model
adapter_id = "your-username/your-adapter-repo"  # <- replace with this repo id

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

tok = AutoTokenizer.from_pretrained(base_id, use_fast=True)
base = AutoModelForCausalLM.from_pretrained(
    base_id,
    quantization_config=bnb_config,
    device_map="auto"
)

model = PeftModel.from_pretrained(base, adapter_id).eval()

prompt = "Explain why the derivative of x^2 is 2x, step by step."
inputs = tok(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.7,
        top_p=0.9,
        do_sample=True
    )
print(tok.decode(out[0], skip_special_tokens=True))

2) Load with Unsloth's FastLanguageModel

from unsloth import FastLanguageModel
import torch

base_id = "unsloth/qwen3-4b-unsloth-bnb-4bit"
adapter_id = "your-username/your-adapter-repo"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=base_id,
    max_seq_length=2048,
    load_in_4bit=True,
    dtype=torch.bfloat16,
)
model = FastLanguageModel.from_pretrained(model=model, model_name=adapter_id)  # attach LoRA
model.eval()

prompt = "List three key differences between GRPO and PPO."
inp = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**inp, max_new_tokens=256, temperature=0.7, top_p=0.9, do_sample=True)
print(tokenizer.decode(out[0], skip_special_tokens=True))

3) Chat-style prompting (Qwen template)

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base_id = "unsloth/qwen3-4b-unsloth-bnb-4bit"
adapter_id = "your-username/your-adapter-repo"

tok = AutoTokenizer.from_pretrained(base_id)
base = AutoModelForCausalLM.from_pretrained(base_id, device_map="auto", load_in_4bit=True)
model = PeftModel.from_pretrained(base, adapter_id).eval()

messages = [
    {"role": "system", "content": "You are a helpful AI assistant specialized in step-by-step reasoning."},
    {"role": "user", "content": "Solve: If x + y = 10 and x - y = 2, what is x * y? Show the steps."},
]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok([text], return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7, top_p=0.9, do_sample=True)
print(tok.decode(outputs[0], skip_special_tokens=True))

Training Details

Data

Use any preference or reward style data suitable for GRPO (e.g., verified solutions vs. distractors).Example recipe (replace with your actual sources):

HuggingFaceH4/aime_2024 for math-style prompts
Your curated reasoning/chat data

Please attach dataset cards or links, and ensure you have the right to use and share the data.

Procedure

Algorithm: GRPO via 🤗 TRL
Base: 4-bit Qwen3-4B (NF4); train only LoRA parameters
LoRA config (example):
- r = 16–64, alpha = 16–64, dropout = 0.05
- target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj (adjust to taste)
Sequence length: 2k (can extend if you reinit rotary / use rope scaling with a compatible base)
Batching (example):
- Per-device batch size: 12
- Gradient accumulation: 2
- Effective batch size: 24
Optimization (example):
- Optimizer: AdamW (β1=0.9, β2=0.999, weight_decay=0.1)
- LR: 1e-5 to 3e-5 (warmup 3–5%)
- Scheduler: cosine
- Mixed precision: bfloat16 compute with 4-bit base weights
Steps/epochs: Adjust to your dataset size; start small (1–3 epochs) and monitor stability

Tip: If loss oscillates, lower LR, increase group size, and ensure reward normalization/stability in your GRPO config.

Example: TRL GRPO skeleton

from trl import GRPOConfig, GRPOTrainer
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

base_id = "unsloth/qwen3-4b-unsloth-bnb-4bit"

bnb = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(base_id, quantization_config=bnb, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(base_id, use_fast=True)

peft_config = LoraConfig(
    r=32, lora_alpha=32, lora_dropout=0.05, bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
)

grpo_config = GRPOConfig(
    learning_rate=1e-5,
    beta=0.01,
    group_size=4,
    per_device_train_batch_size=12,
    gradient_accumulation_steps=2,
    num_train_epochs=3,
    logging_steps=10,
)

trainer = GRPOTrainer(
    model=model,
    args=grpo_config,
    processing_class=tokenizer,
    peft_config=peft_config,
    train_dataset=your_train_dataset,  # replace
    reward_funcs=[your_reward_fn],      # replace
)

trainer.train()
trainer.save_model("grpo-lora-adapter")

Speeds, Sizes, Times (example env)

GPU: NVIDIA T4 / A10 / 4090-class works well
Transformers: 4.55.0
PEFT: 0.17.0
TRL: 0.9+
Torch: 2.4+
Checkpoint size: Adapter typically tens to hundreds of MB (depends on r/targets)

Evaluation

Data, Factors & Metrics

Data: Use domain-relevant sets (e.g., GSM8K, AIME-style sets) aligned with your goals
Metrics: exact match / pass@1 for math; ROUGE/BLEU for summaries; win-rate vs. baseline for preference tasks
Factors: prompt style, temperature, sampling settings, chain-of-thought visibility

Results

Provide your benchmark table here once available. PRs with community evals are welcome.

Summary

Qualitatively, model provides more structured, step-by-step answers than the base.
Quantitative evaluation pending.

Environmental Impact

Estimate with ML CO2 Impact.

Hardware: e.g., 1×T4 for 2 hours
Cloud: e.g., GCP us-central1
Energy/CO₂: Report if tracked

Technical Specifications

Architecture & Objective

Qwen3-4B decoder-only transformer
GRPO objective over grouped responses to the same prompt
LoRA adapter on attention and MLP projections

Compute

Hardware: single-GPU fine-tuning feasible (>=12–16 GB VRAM with 4-bit base)
Software: PyTorch, Transformers, TRL, PEFT, bitsandbytes, Unsloth

Citation

If you use this work, please cite Qwen and TRL; cite your own report/blog if available.

@software{qwen3_2024,
  title={Qwen3 Language Models},
  author={Qwen Team},
  year={2024},
  url={https://huggingface.co/Qwen}
}

@misc{trl_library,
  title={{TRL}: Transformer Reinforcement Learning},
  author={von Werra, L. and others},
  year={2023},
  howpublished={\url{https://github.com/huggingface/trl}}
}

Glossary

LoRA: Low-Rank Adapters enabling parameter-efficient fine-tuning.
GRPO: Group Relative Policy Optimization — an RL-style objective using grouped responses and relative rewards.
NF4: NormalFloat4 quantization format (bitsandbytes) for efficient 4-bit base weights.

More Information

Replace placeholders with your repo id, dataset links, and any blog/demo.
Add a separate LICENSE file if you want a distinct license for the adapter.

Model Card Authors

Taha Majlesi

Contact

https://www.linkedin.com/in/tahamajlesi/ https://github.com/tahamajs

Framework versions

Transformers 4.55.0
TRL 0.9+
PEFT 0.17.0
PyTorch 2.4+
bitsandbytes 0.43+
Unsloth 2025.x

tahamajs
/

Qwen3-4b-gsm8k-Qlora-GRPO