Qwen3.5 4B — Claude Opus Reasoning Distillation

A careful approach to distillation: Premium reasoning capabilities transferred in a single epoch with minimal capability loss.

alt="General Benchmark Comparison Chart"

Before you dismiss this as yet another community distillation with the usual quality tradeoffs — stop and read this.

This model takes a more careful approach to distillation. We've transferred Claude Opus 4.6's reasoning patterns and conversational style into Qwen3.5-4B while avoiding the catastrophic forgetting that plagues many community distillation attempts. The result: net improvements across most benchmarks with only minor tradeoffs.


🎯 Why This Model is Different

The Distillation Problem Everyone Ignores

Most community distillations follow a predictable pattern:

  1. Collect synthetic data from a frontier model
  2. Train for multiple epochs until loss looks good
  3. Ship it and hope for the best

The result? Models that feel different but perform worse. They lose capabilities on benchmarks, develop repetition issues, forget how to follow instructions properly, perform noticeably worse on coding & math tasks, and exhibit the telltale signs of overfitting that make them unreliable for real-world use.

We took a completely different approach.

The Single-Epoch Revolution

Our methodology proves that quality dramatically outweighs quantity in distillation:

Aspect Typical Community Distills Our Approach
Epochs 2-4 epochs 1 epoch
Data Quality Mass-generated synthetic Hand-curated Opus reasoning traces
Capability Retention Significant regressions Mostly preserved with net gains
Overfitting Common None observed
Output Quality Degraded task completion Clean, purposeful generation

By training for exactly one epoch on curated data, we achieve style transfer while minimizing damage to the model's foundational capabilities. Most of the base model's knowledge remains intact while gaining reasoning patterns from Claude Opus.


🧠 What Makes the Training Data Special

Premium Reasoning from Claude Opus 4.6

This isn't data scraped from random API calls or generated with lazy prompting. Almost every training example comes from Claude Opus 4.6 — Anthropic's most capable reasoning model — executing complex, multi-step reasoning tasks. To strengthen the data corpus another ~800 examples were used from Claude Sonnet 4.6

The dataset includes:

  • Deep analytical reasoning with explicit thinking traces
  • Multi-turn conversations that maintain coherent context
  • Complex problem decomposition showing how to break down difficult problems
  • Self-correction patterns where the model catches and fixes its own mistakes

Mixed Tool + Non-Tool Corpus

Our training corpus intentionally includes:

  • ~92% pure reasoning examples — analytical thinking, problem-solving, explanations
  • ~8% tool-use examples — web search, data fetching, structured operations

This ratio mirrors realistic assistant usage patterns and ensures the model:

  1. Doesn't over-index on tool calling when it's unnecessary
  2. Knows when and how to invoke tools appropriately
  3. Maintains strong reasoning even when tools are available but not needed
  4. Keeps all code-related post-training intact

Tools included: web_search, web_fetch, grep


📊 Benchmark Results

Head-to-head against the base unsloth/Qwen3.5-4B:

Benchmark Base Fine-tuned Δ Result
ifeval 0.262 0.309 +17.6% ✅ Win
arc_challenge 0.346 0.392 +13.3% ✅ Win
winogrande 0.589 0.638 +8.3% ✅ Win
hellaswag 0.496 0.500 +0.9% ✅ Win
gpqa_diamond 0.283 0.283 0% ➖ Tie
truthfulqa_mc2 0.545 0.530 -2.7% ❌ Loss
mmlu 0.256 0.232 -9.6% ❌ Loss

Summary: 4 wins, 2 losses, 1 tie.

alt="MMLU Subject Breakdown"

What This Means

  • Reasoning & instruction following improved — IFEval (+17.6%), ARC (+13.3%), and Winogrande (+8.3%) gains show better logical reasoning and instruction adherence
  • Knowledge tradeoff on MMLU — The -9.6% MMLU drop suggests some factual recall displacement (common in style transfers)
  • TruthfulQA mostly preserved — Only -2.7% loss, indicating the model didn't pick up hallucination tendencies

Qualitative Improvements

  • Reduced token generation — More concise outputs without verbose padding
  • Fixed thinking loops — Base model's tendency to get stuck in reasoning cycles is reduced
  • Deeper reasoning traces<think> blocks show more structured analytical depth
  • Better conversational flow — Responses feel more natural and contextually aware

🔬 Technical Details

Key Methodological Choices

  1. Response-only training — Loss computed only on assistant outputs, not user inputs
  2. Preserved reasoning traces<think> blocks kept intact for reasoning-style transfer
  3. Strict data validation — Malformed traces, duplicates, and broken tool calls removed
  4. Consistent formatting — Unified chat template across all sources

📦 Dataset Composition

Source Examples Type
TeichAI/Claude-Opus-4.6-Reasoning-887x 887 Mixed
TeichAI/Claude-Sonnet-4.6-Reasoning-799x 799 Pure reasoning
TeichAI/claude-4.5-opus-high-reasoning-250x 250 High complexity
Crownelius/Opus-4.6-Reasoning-2100x-formatted 2100 Pure reasoning
Total ~4000 Mixed tool/non-tool

💡 Lessons Learned

What Worked

  1. Single epoch training — Avoided the overfitting that causes catastrophic forgetting in multi-epoch runs
  2. Quality over quantity — ~4000 curated examples outperformed what we'd expect from larger noisy datasets
  3. Mixed tool/non-tool data — Kept the model grounded in both reasoning and tool-use contexts
  4. Response-only loss — Training only on assistant outputs preserved instruction-following

Tradeoffs to Consider

  • Small MMLU/TruthfulQA regressions suggest some factual knowledge displacement
  • Style transfer always has costs — this approach minimizes but doesn't eliminate them
  • Your mileage may vary depending on use case

🙏 Acknowledgments

This model was trained 2x faster with Unsloth and Hugging Face's TRL library.


📜 License

Apache 2.0 — Use freely, build boldly.

Downloads last month
241
GGUF
Model size
4B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for wop/Opus4Qwen4

Finetuned
Qwen/Qwen3.5-4B
Quantized
(4)
this model

Datasets used to train wop/Opus4Qwen4

Free AI Image Generator No sign-up. Instant results. Open Now