Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw

A sensitivity-graded 3.6-bit MLX quantization of moonshotai/Kimi-K2.7-Code — a ~1T-parameter (32B active) DeepSeek-V3-style MoE coding model — built to run on Apple-Silicon M3 Ultra hardware.

📘 This is the recommended build for text / code on MLX. It quantizes only the language model and targets the mature mlx-lm stack — the validated two-machine pipeline (≈18 tok/s, below) and mlx_lm.server. pipeline_tag is text-generation; it does not take image input.

Need vision? The identical model + MoonViT vision tower is the -VLM sibling — byte-identical LLM weights, just +0.9 GB of vision tensors, runs on mlx-vlm. Kimi-K2.7-Code is natively image-text-to-text; this text build trades the vision path for the leaner, more battle-tested mlx-lm text toolchain. For pure text/code, prefer this one; grab -VLM only if you need image/video.

465 GB (433 GiB) on disk. It fits a single clean 512 GB M3 Ultra, and runs with huge headroom split across two 512 GB machines (≈233 GB per box) over Thunderbolt.

quality

Why this build exists

Moonshot ships Kimi-K2.7-Code with its routed experts already INT4 (compressed-tensors, group-size 32, QAT) and everything else in bf16 — about 595 GB. The community MLX conversions either keep the experts at 4-bit and so need ~600–768 GB of memory (don't fit a single 512 GB box), or drop uniformly to ~3.5-bit.

This build takes a different route:

  • Re-quantizes from the INT4 master correctly. mlx-lm mishandles compressed-tensors checkpoints during re-quantization (#907) — asking for 3-bit yields 4.99 bpw / 640 GB. We dequantize the packed INT4 experts in-memory before re-quantizing, so the experts actually land below 4-bit.
  • Spends bits where they matter. Routed experts (≈99% of params) go to 3-bit; the residual-writing down_proj is upgraded to 4-bit on a spread of 16 layers (the most quantization-sensitive expert projection); attention (MLA), shared experts, the dense layer and the embeddings/head all stay at 6-bit; the MoE router stays bf16.

The result is a genuinely 512-GB-class build that keeps reasoning and coding quality intact.

Recipe (verified from the output config.json)

recipe

Component Params share Bits Group
Routed experts gate_proj / up_proj bulk 3-bit 64
Routed experts down_proj bulk 4-bit on 16 / 60 layers, 3-bit on the rest 64
Attention (MLA q_a/q_b/kv_a/kv_b/o) small 6-bit 64
Shared expert · dense MLP (layer 0) small 6-bit 64
Token embedding · LM head small 6-bit 64
MoE router gate tiny bf16 (never quantized)

Effective 3.62 bits/weight. The down_proj-on-a-subset upgrade is the cheapest quality lever for a low-bit MoE — the residual-writing projection is by far the most sensitive expert matrix.

Memory & how to run it

memory

Kimi-K2.7-Code uses Multi-head Latent Attention (MLA), so the KV cache is tiny — only the compressed latent (kv_lora_rank 512 + rope 64 = 576 values) per token per layer, ≈ 68.6 KB/token across all 61 layers.

Context KV cache (fp16) KV cache (int8)
32K 2.3 GB 1.2 GB
128K 9.2 GB 4.6 GB
256K (native max) 18.4 GB 9.2 GB

The model's native context window is 256K tokens (262,144; YaRN-extended, rope_theta 50000). Even at the full 256K, weights + KV stay far under memory on either deployment.

Single clean 512 GB box — 465 GB weights leave ~47 GB; fine for inference, comfortable with int8 KV.

Two M3 Ultras over Thunderbolt (pipeline-parallel, recommended for headroom) — each machine loads only its layer-half (~233 GB), so neither box approaches its 512 GB limit, leaving room for other workloads.

# 2-machine pipeline-parallel (mlx-lm, ring backend over a Thunderbolt bridge)
mlx.launch --backend ring --hosts <ip0>,<ip1> python your_generate.py
# each rank: model, tok = mlx_lm.utils.pipeline_load("/path/to/this/model")

How it compares (Kimi-K2.7-Code MLX builds)

field

Build bpw Size Fits 512 GB single?
pipenetwork 4bit-hiprec ~5.0 ~600 GB ✗ (needs ~768 GB)
spicyneuron 3.6bit ~3.6 ~460 GB borderline
this build 3.62 465 GB ✓ (clean box) / ✓✓ split across 2

This build is the only one shipped with the experts re-quantized below their INT4 master via the #907 fix, a down_proj-protected sensitivity-graded recipe, and a verified two-machine pipeline path.

Benchmarks

Same harness as the reference runs (mlx_lm.perplexity on allenai/tulu-3-sft-mixture, seq 2048, 50 samples).

Metric Value
Perplexity (tulu, seq 2048, n=50) 3.735 ± 0.033
KL(4-bit ref ‖ this) · mean per-token, n=4096 0.199 ± 0.009 nats (median 0.006)
top-1 flip-rate vs 4-bit ref 10.2% (416/4096)
Decode throughput (2-machine pipeline, TB) ~18 tok/s
Prefill throughput (warm) ~59 tok/s
Peak memory, split across 2 machines 233 GB / 226 GB

PPL is reported for build-vs-build comparison only — absolute perplexity is misleading at low bit-width. The KL divergence against a 4-bit reference of this same model (experts 4-bit g64, everything else 6-bit; the experts' native master is INT4, so 4-bit is effectively near-lossless for them) is the truer low-bit quality metric — it exposes distribution drift that PPL's averaging hides (Accuracy is Not All You Need). The picture is two-sided and worth stating honestly:

  • On typical tokens the 3-bit experts cost almost nothing — the median per-token KL is 0.006 nats, and mean KL is only 0.076 across the ~90% of positions whose greedy top-1 is unchanged.
  • But on ~10% of positions the greedy top-1 token flips vs the 4-bit reference, and those flips are mostly decisive, not coin-flips: at a flip the reference assigned its own pick ≈0.59 probability while this build's pick got ≈0.14 (only 15% of flips are near-ties). That tail (mean KL ≈1.29 on flipped positions) lifts the overall mean KL to 0.199 nats.

So: distributionally very close to a 4-bit build most of the time, with a real ~10% greedy-divergence cost for dropping the bulk of the experts to 3-bit — the expected price of fitting a ~1T model into 465 GB. PPL (3.735) also sits right next to far larger MLX MoE builds on the same harness.

What the 4-bit reference is — it's the original. The routed experts (≈99% of the weights) ship as a native INT4 master; there is no higher-precision source for them. So every 4/6/8-bit build sits at ≈ that master (you cannot beat INT4) and differs only in size — a 4-bit build is effectively the original model. This 3.6-bit build is therefore the one meaningful step below the original, and the KL 0.199 / 10.2 % flip above is its measured cost against the original itself, not against an arbitrary higher-bit sibling. Going to 6- or 8-bit would only grow the file, not the quality.

Correctness

  • Real-weight reconstruction of the INT4 experts verified sane across early/mid/late layers (finite, expected magnitudes).
  • Coherent generation in the distributed setup: correct iterative-Fibonacci and merge_intervals implementations (with passing asserts), correct technical Q&A. Kimi-K2.7-Code is a reasoning model and emits <think>…</think> before its answer.
  • Per-machine peak memory in the 2-machine run: 233 GB / 226 GB.

Usage

This build needs an mlx-lm that loads compressed-tensors-derived Kimi (model_type: kimi_k25, DeepSeek-V3 engine). Use trust_remote_code=True (the tokenizer is tiktoken-based — pip install tiktoken blobfile).

from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler

model, tok = load("avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw",
                  trust_remote_code=True,
                  tokenizer_config={"trust_remote_code": True})
msgs = [{"role": "user", "content": "Write a Python LRU cache with O(1) get/put."}]
prompt = tok.apply_chat_template(msgs, add_generation_prompt=True)
print(generate(model, tok, prompt=prompt, max_tokens=512, sampler=make_sampler(temp=0.0)))

Credits

Citation

Kimi-K2.7-Code-MLX-3.6bpw — sensitivity-graded 3.6-bit MLX quantization of Kimi-K2.7-Code, 2026. Base model: Moonshot AI, Kimi-K2.7-Code.

Downloads last month
441
Safetensors
Model size
1T params
Tensor type
BF16
·
U32
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw

Quantized
(21)
this model

Paper for avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw

Free AI Image Generator No sign-up. Instant results. Open Now