gghfez's picture
Update README.md
1f22824 verified
metadata
quantized_by: gghfez
pipeline_tag: text-generation
base_model: deepseek-ai/DeepSeek-V3.1
license: mit
base_model_relation: quantized
tags:
  - mla
  - imatrix
  - deepseek_v3.1
  - conversational
  - ik_llama.cpp

ik_llama.cpp imatrix Quantizations of deepseek-ai/DeepSeek-V3.1

This quant REQUIRES ik_llama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!

NOTE ik_llama.cpp can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.

I made this for myself and my RAM+VRAM setup. For more ik_llama quants of this model, discussions, perplexity measurements, see @ubergarm's DeepSeek-V3.1 Collection

👈 Quant details
#!/usr/bin/env bash

custom="
# First 3 dense layers (0-3) (GPU)
# Using q8_0 for attn_k_b since imatrix might not have these tensors
blk\.[0-2]\.attn_k_b.*=q8_0
blk\.[0-2]\.attn_.*=iq5_ks
blk\.[0-2]\.ffn_down.*=iq5_ks
blk\.[0-2]\.ffn_(gate|up).*=iq4_ks
blk\.[0-2]\..*=iq5_ks

# All attention, norm weights, and bias tensors for MoE layers (3-60) (GPU)
# Using q8_0 for attn_k_b since imatrix might not have these tensors
blk\.[3-9]\.attn_k_b.*=q8_0
blk\.[1-5][0-9]\.attn_k_b.*=q8_0
blk\.60\.attn_k_b.*=q8_0

blk\.[3-9]\.attn_.*=iq5_ks
blk\.[1-5][0-9]\.attn_.*=iq5_ks
blk\.60\.attn_.*=iq5_ks

# Shared Expert (3-60) (GPU)
blk\.[3-9]\.ffn_down_shexp\.weight=iq5_ks
blk\.[1-5][0-9]\.ffn_down_shexp\.weight=iq5_ks
blk\.60\.ffn_down_shexp\.weight=iq5_ks

blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=iq4_ks
blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=iq4_ks
blk\.60\.ffn_(gate|up)_shexp\.weight=iq4_ks

# Routed Experts (3-60) (CPU)
blk\.[3-9]\.ffn_down_exps\.weight=iq3_ks
blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq3_ks
blk\.60\.ffn_down_exps\.weight=iq3_ks

blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq2_ks
blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq2_ks
blk\.60\.ffn_(gate|up)_exps\.weight=iq2_ks

# Token embedding and output tensors (GPU)
token_embd\.weight=iq5_k
output\.weight=q8_0  # Changed to q8_0
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /fast/DeepSeek-V3.1.imatrix \
    /fast/bf16/DeepSeek-V3-00001-of-00030.gguf
    /fast2/quants/DeepSeek-V3.1-IQ2_KS.gguf \
    IQ2_KS \