Cerium-Qwen3-R1-Dev-bf16-mlx
Quick‑look comparison
Model ARC‑c ARC‑e BoolQ HellaS OBQA PiQA WGrande Avg (7)
bf16 0.306 0.377 0.379 0.436 0.356 0.657 0.539 0.4357
q6‑hi 0.310 0.381 0.401 0.437 0.348 0.653 0.538 0.4383
q6 0.313 0.386 0.382 0.439 0.356 0.653 0.523 0.4363
q8‑hi 0.305 0.373 0.379 0.435 0.354 0.661 0.533 0.4343
q8 0.308 0.375 0.380 0.434 0.350 0.658 0.534 0.4341
qx6‑hi 0.305 0.374 0.382 0.437 0.348 0.656 0.535 0.4341
qx6 0.308 0.375 0.382 0.437 0.346 0.653 0.538 0.4341
Key take‑aways:
The “high‑precision” versions (q6‑hi, q8‑hi, qx6‑hi) routinely outperform the plain quantized models on BoolQ and HellaSwag.
The raw BF16 baseline remains competitive on the harder ARC tasks and on PiQA, where it even edges out q6‑hi by a hair.
Overall averages cluster tightly around 0.435; the best performer is q6‑hi (0.4383), about 0.003 points higher than the next best.
What each variant brings
q6‑hi: BoolQ 0.401, HellaSwag 0.437, ARC‑easy 0.381
OpenBookQA 0.348 (lowest)
Highest overall avg; best for tasks needing factual recall & commonsense reasoning.
bf16: ARC‑challenge 0.306, PiQA 0.657 (top)
Winogrande 0.539 (low relative to others)
Pure BF16 is the most “well‑rounded” baseline; no quant‑induced pitfalls.
q6: ARC‑easy 0.386, HellaSwag 0.439 (top)
BoolQ 0.382, Winogrande 0.523 (low)
Slightly better on ARC‑easy and HellaSwag than bf16 but dips in Winogrande.
q8‑hi: PiQA 0.661 (top)
ARC‑challenge 0.305, Winogrande 0.533 (mid‑range)
Good for physical‑reasoning tasks; however, overall avg is lowest.
q8: Same as q8‑hi but with 0.658 PiQA, 0.534 Winogrande ARC‑easy 0.375, Arc‑challenge 0.308
Very close to q8‑hi; negligible difference in avg.
qx6 / qx6‑hi: Comparable to q8 series; slightly better Winogrande 0.538 (qx6) vs 0.535 (qx6‑hi)
Slightly lower on Arc‑challenge & OpenBookQA
Not a clear win over q8 variants; slightly higher avg than bf16.
Recommendation
q6‑hi – +0.003 higher avg than any other variant, with the strongest scores on BoolQ and HellaSwag. Balanced performance across ARC & PiQA
bf16 – slightly lower avg but top on PiQA and decent ARC results; no “drop” on OpenBookQA. Best on PiQA (physical‑reasoning)
q8‑hi – 0.661, the highest PiQA among all. If you need a very small quantized model
q6 or q8 – only ~0.003 lower avg than q6‑hi, but with potentially smaller memory footprint or faster inference.
Bottom line:
If your application values overall mean performance (e.g., a multi‑benchmark leaderboard), go with q6‑hi.
If you prefer a pure BF16 baseline or need the best PiQA score, bf16 or q8‑hi are solid alternatives.
This model Cerium-Qwen3-R1-Dev-bf16-mlx was converted to MLX format from prithivMLmods/Cerium-Qwen3-R1-Dev using mlx-lm version 0.26.3.
Use with mlx
pip install mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("Cerium-Qwen3-R1-Dev-bf16-mlx")
prompt = "hello"
if tokenizer.chat_template is not None:
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
messages, add_generation_prompt=True
)
response = generate(model, tokenizer, prompt=prompt, verbose=True)
- Downloads last month
- 5
Model tree for nightmedia/Cerium-Qwen3-R1-Dev-bf16-mlx
Base model
Qwen/Qwen3-0.6B-Base