Instructions to use avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw
Run Hermes
hermes
- MLX LM
How to use avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw", "messages": [ {"role": "user", "content": "Hello"} ] }'
Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw
A sensitivity-graded 3.6-bit MLX quantization of moonshotai/Kimi-K2.7-Code — a ~1T-parameter (32B active) DeepSeek-V3-style MoE coding model — built to run on Apple-Silicon M3 Ultra hardware.
📘 This is the recommended build for text / code on MLX. It quantizes only the language model and targets the mature
mlx-lmstack — the validated two-machine pipeline (≈18 tok/s, below) andmlx_lm.server.pipeline_tagistext-generation; it does not take image input.Need vision? The identical model + MoonViT vision tower is the
-VLMsibling — byte-identical LLM weights, just +0.9 GB of vision tensors, runs onmlx-vlm. Kimi-K2.7-Code is natively image-text-to-text; this text build trades the vision path for the leaner, more battle-tested mlx-lm text toolchain. For pure text/code, prefer this one; grab-VLMonly if you need image/video.
465 GB (433 GiB) on disk. It fits a single clean 512 GB M3 Ultra, and runs with huge headroom split across two 512 GB machines (≈233 GB per box) over Thunderbolt.
Why this build exists
Moonshot ships Kimi-K2.7-Code with its routed experts already INT4 (compressed-tensors, group-size 32, QAT) and everything else in bf16 — about 595 GB. The community MLX conversions either keep the experts at 4-bit and so need ~600–768 GB of memory (don't fit a single 512 GB box), or drop uniformly to ~3.5-bit.
This build takes a different route:
- Re-quantizes from the INT4 master correctly.
mlx-lmmishandlescompressed-tensorscheckpoints during re-quantization (#907) — asking for 3-bit yields 4.99 bpw / 640 GB. We dequantize the packed INT4 experts in-memory before re-quantizing, so the experts actually land below 4-bit. - Spends bits where they matter. Routed experts (≈99% of params) go to 3-bit; the residual-writing
down_projis upgraded to 4-bit on a spread of 16 layers (the most quantization-sensitive expert projection); attention (MLA), shared experts, the dense layer and the embeddings/head all stay at 6-bit; the MoE router stays bf16.
The result is a genuinely 512-GB-class build that keeps reasoning and coding quality intact.
Recipe (verified from the output config.json)
| Component | Params share | Bits | Group |
|---|---|---|---|
Routed experts gate_proj / up_proj |
bulk | 3-bit | 64 |
Routed experts down_proj |
bulk | 4-bit on 16 / 60 layers, 3-bit on the rest | 64 |
Attention (MLA q_a/q_b/kv_a/kv_b/o) |
small | 6-bit | 64 |
| Shared expert · dense MLP (layer 0) | small | 6-bit | 64 |
| Token embedding · LM head | small | 6-bit | 64 |
MoE router gate |
tiny | bf16 (never quantized) | — |
Effective 3.62 bits/weight. The down_proj-on-a-subset upgrade is the cheapest quality lever for a low-bit MoE — the residual-writing projection is by far the most sensitive expert matrix.
Memory & how to run it
Kimi-K2.7-Code uses Multi-head Latent Attention (MLA), so the KV cache is tiny — only the compressed latent (kv_lora_rank 512 + rope 64 = 576 values) per token per layer, ≈ 68.6 KB/token across all 61 layers.
| Context | KV cache (fp16) | KV cache (int8) |
|---|---|---|
| 32K | 2.3 GB | 1.2 GB |
| 128K | 9.2 GB | 4.6 GB |
| 256K (native max) | 18.4 GB | 9.2 GB |
The model's native context window is 256K tokens (262,144; YaRN-extended, rope_theta 50000). Even at the full 256K, weights + KV stay far under memory on either deployment.
Single clean 512 GB box — 465 GB weights leave ~47 GB; fine for inference, comfortable with int8 KV.
Two M3 Ultras over Thunderbolt (pipeline-parallel, recommended for headroom) — each machine loads only its layer-half (~233 GB), so neither box approaches its 512 GB limit, leaving room for other workloads.
# 2-machine pipeline-parallel (mlx-lm, ring backend over a Thunderbolt bridge)
mlx.launch --backend ring --hosts <ip0>,<ip1> python your_generate.py
# each rank: model, tok = mlx_lm.utils.pipeline_load("/path/to/this/model")
How it compares (Kimi-K2.7-Code MLX builds)
| Build | bpw | Size | Fits 512 GB single? |
|---|---|---|---|
| pipenetwork 4bit-hiprec | ~5.0 | ~600 GB | ✗ (needs ~768 GB) |
| spicyneuron 3.6bit | ~3.6 | ~460 GB | borderline |
| this build | 3.62 | 465 GB | ✓ (clean box) / ✓✓ split across 2 |
This build is the only one shipped with the experts re-quantized below their INT4 master via the #907 fix, a down_proj-protected sensitivity-graded recipe, and a verified two-machine pipeline path.
Benchmarks
Same harness as the reference runs (mlx_lm.perplexity on allenai/tulu-3-sft-mixture, seq 2048, 50 samples).
| Metric | Value |
|---|---|
| Perplexity (tulu, seq 2048, n=50) | 3.735 ± 0.033 |
| KL(4-bit ref ‖ this) · mean per-token, n=4096 | 0.199 ± 0.009 nats (median 0.006) |
| top-1 flip-rate vs 4-bit ref | 10.2% (416/4096) |
| Decode throughput (2-machine pipeline, TB) | ~18 tok/s |
| Prefill throughput (warm) | ~59 tok/s |
| Peak memory, split across 2 machines | 233 GB / 226 GB |
PPL is reported for build-vs-build comparison only — absolute perplexity is misleading at low bit-width. The KL divergence against a 4-bit reference of this same model (experts 4-bit g64, everything else 6-bit; the experts' native master is INT4, so 4-bit is effectively near-lossless for them) is the truer low-bit quality metric — it exposes distribution drift that PPL's averaging hides (Accuracy is Not All You Need). The picture is two-sided and worth stating honestly:
- On typical tokens the 3-bit experts cost almost nothing — the median per-token KL is 0.006 nats, and mean KL is only 0.076 across the ~90% of positions whose greedy top-1 is unchanged.
- But on ~10% of positions the greedy top-1 token flips vs the 4-bit reference, and those flips are mostly decisive, not coin-flips: at a flip the reference assigned its own pick ≈0.59 probability while this build's pick got ≈0.14 (only 15% of flips are near-ties). That tail (mean KL ≈1.29 on flipped positions) lifts the overall mean KL to 0.199 nats.
So: distributionally very close to a 4-bit build most of the time, with a real ~10% greedy-divergence cost for dropping the bulk of the experts to 3-bit — the expected price of fitting a ~1T model into 465 GB. PPL (3.735) also sits right next to far larger MLX MoE builds on the same harness.
What the 4-bit reference is — it's the original. The routed experts (≈99% of the weights) ship as a native INT4 master; there is no higher-precision source for them. So every 4/6/8-bit build sits at ≈ that master (you cannot beat INT4) and differs only in size — a 4-bit build is effectively the original model. This 3.6-bit build is therefore the one meaningful step below the original, and the KL 0.199 / 10.2 % flip above is its measured cost against the original itself, not against an arbitrary higher-bit sibling. Going to 6- or 8-bit would only grow the file, not the quality.
Correctness
- Real-weight reconstruction of the INT4 experts verified sane across early/mid/late layers (finite, expected magnitudes).
- Coherent generation in the distributed setup: correct iterative-Fibonacci and
merge_intervalsimplementations (with passing asserts), correct technical Q&A. Kimi-K2.7-Code is a reasoning model and emits<think>…</think>before its answer. - Per-machine peak memory in the 2-machine run: 233 GB / 226 GB.
Usage
This build needs an mlx-lm that loads compressed-tensors-derived Kimi (model_type: kimi_k25, DeepSeek-V3 engine). Use trust_remote_code=True (the tokenizer is tiktoken-based — pip install tiktoken blobfile).
from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler
model, tok = load("avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw",
trust_remote_code=True,
tokenizer_config={"trust_remote_code": True})
msgs = [{"role": "user", "content": "Write a Python LRU cache with O(1) get/put."}]
prompt = tok.apply_chat_template(msgs, add_generation_prompt=True)
print(generate(model, tok, prompt=prompt, max_tokens=512, sampler=make_sampler(temp=0.0)))
Credits
- Base model: moonshotai/Kimi-K2.7-Code (Modified MIT).
- Built with mlx-lm on Apple MLX.
Citation
Kimi-K2.7-Code-MLX-3.6bpw — sensitivity-graded 3.6-bit MLX quantization of Kimi-K2.7-Code, 2026. Base model: Moonshot AI, Kimi-K2.7-Code.
- Downloads last month
- 441
3-bit
Model tree for avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw
Base model
moonshotai/Kimi-K2.7-Code


