Cerium-Qwen3-R1-Dev-bf16-mlx

Quick‑look comparison

Model	ARC‑c	ARC‑e	BoolQ	HellaS	OBQA	PiQA	WGrande	Avg (7)
bf16    0.306	0.377	0.379	0.436	0.356	0.657	0.539	0.4357
q6‑hi	0.310	0.381	0.401	0.437	0.348	0.653	0.538	0.4383
q6      0.313	0.386	0.382	0.439	0.356	0.653	0.523	0.4363
q8‑hi   0.305	0.373	0.379	0.435	0.354	0.661	0.533	0.4343
q8      0.308	0.375	0.380	0.434	0.350	0.658	0.534	0.4341
qx6‑hi  0.305	0.374	0.382	0.437	0.348	0.656	0.535	0.4341
qx6     0.308	0.375	0.382	0.437	0.346	0.653	0.538	0.4341

Key take‑aways:

The “high‑precision” versions (q6‑hi, q8‑hi, qx6‑hi) routinely outperform the plain quantized models on BoolQ and HellaSwag.

The raw BF16 baseline remains competitive on the harder ARC tasks and on PiQA, where it even edges out q6‑hi by a hair.

Overall averages cluster tightly around 0.435; the best performer is q6‑hi (0.4383), about 0.003 points higher than the next best.

What each variant brings

q6‑hi: BoolQ 0.401, HellaSwag 0.437, ARC‑easy 0.381

OpenBookQA 0.348 (lowest)

Highest overall avg; best for tasks needing factual recall & commonsense reasoning.

bf16: ARC‑challenge 0.306, PiQA 0.657 (top)

Winogrande 0.539 (low relative to others)

Pure BF16 is the most “well‑rounded” baseline; no quant‑induced pitfalls.

q6: ARC‑easy 0.386, HellaSwag 0.439 (top)

BoolQ 0.382, Winogrande 0.523 (low)

Slightly better on ARC‑easy and HellaSwag than bf16 but dips in Winogrande.

q8‑hi: PiQA 0.661 (top)

ARC‑challenge 0.305, Winogrande 0.533 (mid‑range)

Good for physical‑reasoning tasks; however, overall avg is lowest.

q8: Same as q8‑hi but with 0.658 PiQA, 0.534 Winogrande ARC‑easy 0.375, Arc‑challenge 0.308

Very close to q8‑hi; negligible difference in avg.

qx6 / qx6‑hi: Comparable to q8 series; slightly better Winogrande 0.538 (qx6) vs 0.535 (qx6‑hi)

Slightly lower on Arc‑challenge & OpenBookQA

Not a clear win over q8 variants; slightly higher avg than bf16.

Recommendation

q6‑hi – +0.003 higher avg than any other variant, with the strongest scores on BoolQ and HellaSwag. Balanced performance across ARC & PiQA

bf16 – slightly lower avg but top on PiQA and decent ARC results; no “drop” on OpenBookQA. Best on PiQA (physical‑reasoning)

q8‑hi – 0.661, the highest PiQA among all. If you need a very small quantized model

q6 or q8 – only ~0.003 lower avg than q6‑hi, but with potentially smaller memory footprint or faster inference.

Bottom line:

If your application values overall mean performance (e.g., a multi‑benchmark leaderboard), go with q6‑hi.

If you prefer a pure BF16 baseline or need the best PiQA score, bf16 or q8‑hi are solid alternatives.

This model Cerium-Qwen3-R1-Dev-bf16-mlx was converted to MLX format from prithivMLmods/Cerium-Qwen3-R1-Dev using mlx-lm version 0.26.3.

Use with mlx

pip install mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("Cerium-Qwen3-R1-Dev-bf16-mlx")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)

nightmedia
/

Cerium-Qwen3-R1-Dev-bf16-mlx

Cerium-Qwen3-R1-Dev-bf16-mlx

Use with mlx

Model tree for nightmedia/Cerium-Qwen3-R1-Dev-bf16-mlx