
halley-ai/gpt-oss-20b-MLX-5bit-gs32
Text Generation
•
21B
•
Updated
•
224
•
1
Text Generation & Chat Assistants; Model Compression & Quantization (Q4/Q6/Q8, gs32); Inference & Serving (on-prem, low-latency); RAG / Retrieval; Agents & Tool Use; Distillation / LoRA / Fine-tuning
High-quality, Apple-Silicon–optimized MLX builds, tools, and evals — focused on practical, on-prem inference for small teams.
We publish Mixture-of-Experts (MoE) models and MLX quantizations tuned for M-series Macs (Metal + unified memory).
Target use: fast, reliable interactive chat and light batch workloads.
Repo | Bits/GS | Footprint | Notes |
---|---|---|---|
halley-ai/gpt-oss-20b-MLX-4bit-gs32 | Q4 / 32 | ~13.1 GB | Trades accuracy for footprint; use when RAM is constrained or throughput is the priority. |
halley-ai/gpt-oss-20b-MLX-5bit-gs32 | Q5 / 32 | ~15.8 GB | Small drop vs 6-bit/gs32 and 8-bit/gs64 (~3–6% PPL); “fits-16GB” VRAM when GPU buffer limits matter. |
halley-ai/gpt-oss-20b-MLX-6bit-gs32 | Q6 / 32 | ~18.4 GB | Best of the group; edges out 8-bit/gs64 slightly at a smaller footprint |
Reference (8-bit) | Q8 / 32 | — | See upstream: lmstudio-community/gpt-oss-20b-MLX-8bit |
Format: MLX (not GGUF). For Linux/Windows or non-MLX stacks, use a GGUF build with llama.cpp.