Qwen3-4B — GRPO LoRA (Unsloth Adapter)

A lightweight LoRA adapter for Qwen3-4B trained with Group Relative Policy Optimization (GRPO) using TRL and Unsloth.This adapter improves reasoning-style text generation (step-by-step answers, math/programming explanations) while remaining fast and memory-efficient thanks to 4-bit quantization for base weights.

TL;DR: Plug this adapter into unsloth/qwen3-4b-unsloth-bnb-4bit (or any compatible Qwen3-4B checkpoint), and you get a GRPO-tuned reasoning model that runs comfortably on a single consumer GPU.


Model Details

Model Description

  • Base model: unsloth/qwen3-4b-unsloth-bnb-4bit (Qwen3-4B with 4-bit quantization for faster/cheaper training & inference)
  • Adapter type: LoRA (via 🤗 PEFT)
  • Training objective: GRPO (policy optimization for group-comparative preferences/rewards)
  • Intended style: Helpful, step-by-step, reasoning-focused generation
  • Quantization: Base weights in 4-bit (NF4) via bitsandbytes; LoRA adapter in full precision
  • Context length: 2k tokens by default (inherits from base; adjust as you like)

Developed by: Your name or org here Shared by: Hugging Face handle here Finetuned from: unsloth/qwen3-4b-unsloth-bnb-4bit License: Inherits the base Qwen3 license; add an adapter license (e.g., Apache-2.0) if desired

Model Sources

  • Repository: Add your HF repo URL here
  • Paper (optional): Link to blog/paper describing your GRPO recipe
  • Demo (optional): Spaces/Colab link if available

Uses

Direct Use

  • Reasoning-style text generation (math explanations, step-by-step solutions, debugging hints)
  • General Q&A, tutoring-like explanations
  • Chain-of-thought style output when user explicitly asks (do not disclose hidden private reasoning in production apps)

Downstream Use

  • Further preference optimization (DPO/GRPO/RLHF) on a target domain
  • Task adapters stacked via PEFT

Out-of-Scope Use

  • Unsafe content generation (hate, harassment, illegal activities)
  • High-stakes decision making without human oversight
  • Factual tasks requiring up-to-the-minute knowledge (the model has no browsing by default)

Safety note: Always apply your own content filters and human review in production settings.


Bias, Risks, and Limitations

  • The adapter inherits biases and limitations of the base model and training data.
  • GRPO-tuned behavior may over-index on verbose step-by-step explanations.
  • The model can hallucinate facts or produce misleading justifications.
  • Performance on non-English text is not guaranteed.

Recommendations

  • Keep a human-in-the-loop for critical use cases.
  • Add guardrails (toxicity filters, refusal policies).
  • Evaluate on domain-specific benchmarks before deployment.

How to Get Started

1) Load with vanilla 🤗 Transformers + PEFT

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch

base_id = "unsloth/qwen3-4b-unsloth-bnb-4bit"   # Base 4-bit model
adapter_id = "your-username/your-adapter-repo"  # <- replace with this repo id

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

tok = AutoTokenizer.from_pretrained(base_id, use_fast=True)
base = AutoModelForCausalLM.from_pretrained(
    base_id,
    quantization_config=bnb_config,
    device_map="auto"
)

model = PeftModel.from_pretrained(base, adapter_id).eval()

prompt = "Explain why the derivative of x^2 is 2x, step by step."
inputs = tok(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.7,
        top_p=0.9,
        do_sample=True
    )
print(tok.decode(out[0], skip_special_tokens=True))

2) Load with Unsloth's FastLanguageModel

from unsloth import FastLanguageModel
import torch

base_id = "unsloth/qwen3-4b-unsloth-bnb-4bit"
adapter_id = "your-username/your-adapter-repo"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=base_id,
    max_seq_length=2048,
    load_in_4bit=True,
    dtype=torch.bfloat16,
)
model = FastLanguageModel.from_pretrained(model=model, model_name=adapter_id)  # attach LoRA
model.eval()

prompt = "List three key differences between GRPO and PPO."
inp = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**inp, max_new_tokens=256, temperature=0.7, top_p=0.9, do_sample=True)
print(tokenizer.decode(out[0], skip_special_tokens=True))

3) Chat-style prompting (Qwen template)

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base_id = "unsloth/qwen3-4b-unsloth-bnb-4bit"
adapter_id = "your-username/your-adapter-repo"

tok = AutoTokenizer.from_pretrained(base_id)
base = AutoModelForCausalLM.from_pretrained(base_id, device_map="auto", load_in_4bit=True)
model = PeftModel.from_pretrained(base, adapter_id).eval()

messages = [
    {"role": "system", "content": "You are a helpful AI assistant specialized in step-by-step reasoning."},
    {"role": "user", "content": "Solve: If x + y = 10 and x - y = 2, what is x * y? Show the steps."},
]
text = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok([text], return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7, top_p=0.9, do_sample=True)
print(tok.decode(outputs[0], skip_special_tokens=True))

Training Details

Data

Use any preference or reward style data suitable for GRPO (e.g., verified solutions vs. distractors).Example recipe (replace with your actual sources):

  • HuggingFaceH4/aime_2024 for math-style prompts
  • Your curated reasoning/chat data

Please attach dataset cards or links, and ensure you have the right to use and share the data.

Procedure

  • Algorithm: GRPO via 🤗 TRL
  • Base: 4-bit Qwen3-4B (NF4); train only LoRA parameters
  • LoRA config (example):
    • r = 16–64, alpha = 16–64, dropout = 0.05
    • target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj (adjust to taste)
  • Sequence length: 2k (can extend if you reinit rotary / use rope scaling with a compatible base)
  • Batching (example):
    • Per-device batch size: 12
    • Gradient accumulation: 2
    • Effective batch size: 24
  • Optimization (example):
    • Optimizer: AdamW (β1=0.9, β2=0.999, weight_decay=0.1)
    • LR: 1e-5 to 3e-5 (warmup 3–5%)
    • Scheduler: cosine
    • Mixed precision: bfloat16 compute with 4-bit base weights
  • Steps/epochs: Adjust to your dataset size; start small (1–3 epochs) and monitor stability

Tip: If loss oscillates, lower LR, increase group size, and ensure reward normalization/stability in your GRPO config.

Example: TRL GRPO skeleton

from trl import GRPOConfig, GRPOTrainer
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

base_id = "unsloth/qwen3-4b-unsloth-bnb-4bit"

bnb = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(base_id, quantization_config=bnb, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(base_id, use_fast=True)

peft_config = LoraConfig(
    r=32, lora_alpha=32, lora_dropout=0.05, bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
)

grpo_config = GRPOConfig(
    learning_rate=1e-5,
    beta=0.01,
    group_size=4,
    per_device_train_batch_size=12,
    gradient_accumulation_steps=2,
    num_train_epochs=3,
    logging_steps=10,
)

trainer = GRPOTrainer(
    model=model,
    args=grpo_config,
    processing_class=tokenizer,
    peft_config=peft_config,
    train_dataset=your_train_dataset,  # replace
    reward_funcs=[your_reward_fn],      # replace
)

trainer.train()
trainer.save_model("grpo-lora-adapter")

Speeds, Sizes, Times (example env)

  • GPU: NVIDIA T4 / A10 / 4090-class works well
  • Transformers: 4.55.0
  • PEFT: 0.17.0
  • TRL: 0.9+
  • Torch: 2.4+
  • Checkpoint size: Adapter typically tens to hundreds of MB (depends on r/targets)

Evaluation

Data, Factors & Metrics

  • Data: Use domain-relevant sets (e.g., GSM8K, AIME-style sets) aligned with your goals
  • Metrics: exact match / pass@1 for math; ROUGE/BLEU for summaries; win-rate vs. baseline for preference tasks
  • Factors: prompt style, temperature, sampling settings, chain-of-thought visibility

Results

  • Provide your benchmark table here once available. PRs with community evals are welcome.

Summary

  • Qualitatively, model provides more structured, step-by-step answers than the base.
  • Quantitative evaluation pending.

Environmental Impact

Estimate with ML CO2 Impact.

  • Hardware: e.g., 1×T4 for 2 hours
  • Cloud: e.g., GCP us-central1
  • Energy/CO₂: Report if tracked

Technical Specifications

Architecture & Objective

  • Qwen3-4B decoder-only transformer
  • GRPO objective over grouped responses to the same prompt
  • LoRA adapter on attention and MLP projections

Compute

  • Hardware: single-GPU fine-tuning feasible (>=12–16 GB VRAM with 4-bit base)
  • Software: PyTorch, Transformers, TRL, PEFT, bitsandbytes, Unsloth

Citation

If you use this work, please cite Qwen and TRL; cite your own report/blog if available.

@software{qwen3_2024,
  title={Qwen3 Language Models},
  author={Qwen Team},
  year={2024},
  url={https://huggingface.co/Qwen}
}

@misc{trl_library,
  title={{TRL}: Transformer Reinforcement Learning},
  author={von Werra, L. and others},
  year={2023},
  howpublished={\url{https://github.com/huggingface/trl}}
}

Glossary

  • LoRA: Low-Rank Adapters enabling parameter-efficient fine-tuning.
  • GRPO: Group Relative Policy Optimization — an RL-style objective using grouped responses and relative rewards.
  • NF4: NormalFloat4 quantization format (bitsandbytes) for efficient 4-bit base weights.

More Information

  • Replace placeholders with your repo id, dataset links, and any blog/demo.
  • Add a separate LICENSE file if you want a distinct license for the adapter.

Model Card Authors

Taha Majlesi

Contact

https://www.linkedin.com/in/tahamajlesi/ https://github.com/tahamajs

Framework versions

  • Transformers 4.55.0
  • TRL 0.9+
  • PEFT 0.17.0
  • PyTorch 2.4+
  • bitsandbytes 0.43+
  • Unsloth 2025.x
Downloads last month
28
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train tahamajs/Qwen3-4b-gsm8k-Qlora-GRPO