Instructions to use avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw

Run Hermes

hermes

MLX LM

How to use avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw

A sensitivity-graded 3.6-bit MLX quantization of moonshotai/Kimi-K2.7-Code — a ~1T-parameter (32B active) DeepSeek-V3-style MoE coding model — built to run on Apple-Silicon M3 Ultra hardware.

📘 This is the recommended build for text / code on MLX. It quantizes only the language model and targets the mature mlx-lm stack — the validated two-machine pipeline (≈18 tok/s, below) and mlx_lm.server. pipeline_tag is text-generation; it does not take image input.

Need vision? The identical model + MoonViT vision tower is the -VLM sibling — byte-identical LLM weights, just +0.9 GB of vision tensors, runs on mlx-vlm. Kimi-K2.7-Code is natively image-text-to-text; this text build trades the vision path for the leaner, more battle-tested mlx-lm text toolchain. For pure text/code, prefer this one; grab -VLM only if you need image/video.

465 GB (433 GiB) on disk. It fits a single clean 512 GB M3 Ultra, and runs with huge headroom split across two 512 GB machines (≈233 GB per box) over Thunderbolt.

Why this build exists

Moonshot ships Kimi-K2.7-Code with its routed experts already INT4 (compressed-tensors, group-size 32, QAT) and everything else in bf16 — about 595 GB. The community MLX conversions either keep the experts at 4-bit and so need ~600–768 GB of memory (don't fit a single 512 GB box), or drop uniformly to ~3.5-bit.

This build takes a different route:

Re-quantizes from the INT4 master correctly. mlx-lm mishandles compressed-tensors checkpoints during re-quantization (#907) — asking for 3-bit yields 4.99 bpw / 640 GB. We dequantize the packed INT4 experts in-memory before re-quantizing, so the experts actually land below 4-bit.
Spends bits where they matter. Routed experts (≈99% of params) go to 3-bit; the residual-writing down_proj is upgraded to 4-bit on a spread of 16 layers (the most quantization-sensitive expert projection); attention (MLA), shared experts, the dense layer and the embeddings/head all stay at 6-bit; the MoE router stays bf16.

The result is a genuinely 512-GB-class build that keeps reasoning and coding quality intact.

Recipe (verified from the output `config.json`)

Component	Params share	Bits	Group
Routed experts `gate_proj` / `up_proj`	bulk	3-bit	64
Routed experts `down_proj`	bulk	4-bit on 16 / 60 layers, 3-bit on the rest	64
Attention (MLA `q_a/q_b/kv_a/kv_b/o`)	small	6-bit	64
Shared expert · dense MLP (layer 0)	small	6-bit	64
Token embedding · LM head	small	6-bit	64
MoE router `gate`	tiny	bf16 (never quantized)	—

Effective 3.62 bits/weight. The down_proj-on-a-subset upgrade is the cheapest quality lever for a low-bit MoE — the residual-writing projection is by far the most sensitive expert matrix.

Memory & how to run it

Kimi-K2.7-Code uses Multi-head Latent Attention (MLA), so the KV cache is tiny — only the compressed latent (kv_lora_rank 512 + rope 64 = 576 values) per token per layer, ≈ 68.6 KB/token across all 61 layers.

Context	KV cache (fp16)	KV cache (int8)
32K	2.3 GB	1.2 GB
128K	9.2 GB	4.6 GB
256K (native max)	18.4 GB	9.2 GB

The model's native context window is 256K tokens (262,144; YaRN-extended, rope_theta 50000). Even at the full 256K, weights + KV stay far under memory on either deployment.

Single clean 512 GB box — 465 GB weights leave ~47 GB; fine for inference, comfortable with int8 KV.

Two M3 Ultras over Thunderbolt (pipeline-parallel, recommended for headroom) — each machine loads only its layer-half (~233 GB), so neither box approaches its 512 GB limit, leaving room for other workloads.

# 2-machine pipeline-parallel (mlx-lm, ring backend over a Thunderbolt bridge)
mlx.launch --backend ring --hosts <ip0>,<ip1> python your_generate.py
# each rank: model, tok = mlx_lm.utils.pipeline_load("/path/to/this/model")

How it compares (Kimi-K2.7-Code MLX builds)

Build	bpw	Size	Fits 512 GB single?
pipenetwork 4bit-hiprec	~5.0	~600 GB	✗ (needs ~768 GB)
spicyneuron 3.6bit	~3.6	~460 GB	borderline
this build	3.62	465 GB	✓ (clean box) / ✓✓ split across 2

This build is the only one shipped with the experts re-quantized below their INT4 master via the #907 fix, a down_proj-protected sensitivity-graded recipe, and a verified two-machine pipeline path.

Benchmarks

Same harness as the reference runs (mlx_lm.perplexity on allenai/tulu-3-sft-mixture, seq 2048, 50 samples).

Metric	Value
Perplexity (tulu, seq 2048, n=50)	3.735 ± 0.033
KL(4-bit ref ‖ this) · mean per-token, n=4096	0.199 ± 0.009 nats (median 0.006)
top-1 flip-rate vs 4-bit ref	10.2% (416/4096)
Decode throughput (2-machine pipeline, TB)	~18 tok/s
Prefill throughput (warm)	~59 tok/s
Peak memory, split across 2 machines	233 GB / 226 GB

PPL is reported for build-vs-build comparison only — absolute perplexity is misleading at low bit-width. The KL divergence against a 4-bit reference of this same model (experts 4-bit g64, everything else 6-bit; the experts' native master is INT4, so 4-bit is effectively near-lossless for them) is the truer low-bit quality metric — it exposes distribution drift that PPL's averaging hides (Accuracy is Not All You Need). The picture is two-sided and worth stating honestly:

On typical tokens the 3-bit experts cost almost nothing — the median per-token KL is 0.006 nats, and mean KL is only 0.076 across the ~90% of positions whose greedy top-1 is unchanged.
But on ~10% of positions the greedy top-1 token flips vs the 4-bit reference, and those flips are mostly decisive, not coin-flips: at a flip the reference assigned its own pick ≈0.59 probability while this build's pick got ≈0.14 (only 15% of flips are near-ties). That tail (mean KL ≈1.29 on flipped positions) lifts the overall mean KL to 0.199 nats.

So: distributionally very close to a 4-bit build most of the time, with a real ~10% greedy-divergence cost for dropping the bulk of the experts to 3-bit — the expected price of fitting a ~1T model into 465 GB. PPL (3.735) also sits right next to far larger MLX MoE builds on the same harness.

What the 4-bit reference is — it's the original. The routed experts (≈99% of the weights) ship as a native INT4 master; there is no higher-precision source for them. So every 4/6/8-bit build sits at ≈ that master (you cannot beat INT4) and differs only in size — a 4-bit build is effectively the original model. This 3.6-bit build is therefore the one meaningful step below the original, and the KL 0.199 / 10.2 % flip above is its measured cost against the original itself, not against an arbitrary higher-bit sibling. Going to 6- or 8-bit would only grow the file, not the quality.

Correctness

Real-weight reconstruction of the INT4 experts verified sane across early/mid/late layers (finite, expected magnitudes).
Coherent generation in the distributed setup: correct iterative-Fibonacci and merge_intervals implementations (with passing asserts), correct technical Q&A. Kimi-K2.7-Code is a reasoning model and emits <think>…</think> before its answer.
Per-machine peak memory in the 2-machine run: 233 GB / 226 GB.

Usage

This build needs an mlx-lm that loads compressed-tensors-derived Kimi (model_type: kimi_k25, DeepSeek-V3 engine). Use trust_remote_code=True (the tokenizer is tiktoken-based — pip install tiktoken blobfile).

from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler

model, tok = load("avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw",
                  trust_remote_code=True,
                  tokenizer_config={"trust_remote_code": True})
msgs = [{"role": "user", "content": "Write a Python LRU cache with O(1) get/put."}]
prompt = tok.apply_chat_template(msgs, add_generation_prompt=True)
print(generate(model, tok, prompt=prompt, max_tokens=512, sampler=make_sampler(temp=0.0)))

Credits

Base model: moonshotai/Kimi-K2.7-Code (Modified MIT).
Built with mlx-lm on Apple MLX.

Citation

Kimi-K2.7-Code-MLX-3.6bpw — sensitivity-graded 3.6-bit MLX quantization of Kimi-K2.7-Code, 2026. Base model: Moonshot AI, Kimi-K2.7-Code.

Downloads last month: 441

Safetensors

Model size

1T params

Tensor type

BF16

U32

F32

MLX

Hardware compatibility

3-bit

Model tree for avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw

Base model

moonshotai/Kimi-K2.7-Code

Quantized

(21)

this model

Paper for avlp12/Kimi-K2.7-Code-Alis-MLX-Dynamic-3.6bpw

Accuracy is Not All You Need

Paper • 2407.09141 • Published Jul 12, 2024 • 3