IQ2_XSS quant of DeepSeek-V3-0324 I made for my 192GB DDR5 + 3090/4090. Done according to:

* IQ2_XXS 169.590 GiB (2.168 BPW)

Not recommended, but should be faster and better quality than the IQ1_S and okay with full offload on multi-GPU. It should be okay for hybrid CPU+GPU inference as well if this size is good for your rig. Probably want to choose the IQ2_KT for full GPU offload.

Special mix IQ2_XXS ffn_(gate|up)_exps and IQ2_KS ffn_down_exps routed experts. Mostly iq4_ks/iq3_ks for attn and shared expert. iq4_k token_embd and iq5_k output "head".

👈 Secret Recipe
#!/usr/bin/env bash

custom="
# First 3 dense layers (0-3) (GPU)
# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
blk\.[0-2]\.attn_k_b.*=q4_0
blk\.[0-2]\.attn_.*=iq4_ks
blk\.[0-2]\.ffn_down.*=iq4_ks
blk\.[0-2]\.ffn_(gate|up).*=iq3_ks
blk\.[0-2]\..*=iq4_ks

# All attention, norm weights, and bias tensors for MoE layers (3-60) (GPU)
# Except blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0
blk\.[3-9]\.attn_k_b.*=q4_0
blk\.[1-5][0-9]\.attn_k_b.*=q4_0
blk\.60\.attn_k_b.*=q4_0

blk\.[3-9]\.attn_.*=iq4_ks
blk\.[1-5][0-9]\.attn_.*=iq4_ks
blk\.60\.attn_.*=iq4_ks

# Shared Expert (3-60) (GPU)
blk\.[3-9]\.ffn_down_shexp\.weight=iq4_ks
blk\.[1-5][0-9]\.ffn_down_shexp\.weight=iq4_ks
blk\.60\.ffn_down_shexp\.weight=iq4_ks

blk\.[3-9]\.ffn_(gate|up)_shexp\.weight=iq3_ks
blk\.[1-5][0-9]\.ffn_(gate|up)_shexp\.weight=iq3_ks
blk\.60\.ffn_(gate|up)_shexp\.weight=iq3_ks

# Routed Experts (3-60) (CPU)
blk\.[3-9]\.ffn_down_exps\.weight=iq2_ks
blk\.[1-5][0-9]\.ffn_down_exps\.weight=iq2_ks
blk\.60\.ffn_down_exps\.weight=iq2_ks

blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq2_xxs
blk\.[1-5][0-9]\.ffn_(gate|up)_exps\.weight=iq2_xxs
blk\.60\.ffn_(gate|up)_exps\.weight=iq2_xxs

# Token embedding and output tensors (GPU)
token_embd\.weight=iq4_k
output\.weight=iq5_k

Prompt format

<|begin▁of▁sentence|>{system_prompt}<|User|>{prompt}<|Assistant|><|end▁of▁sentence|><|Assistant|>

ik_llama.cpp quantizations of DeepSeek-V3-0324

NOTE: These quants MUST be run using the llama.cpp fork, ik_llama.cpp

Credits to @ubergarm for his DeepSeek quant recipes for which these quants were based on.

Downloads last month
87
GGUF
Model size
672B params
Architecture
deepseek2
Hardware compatibility
Log In to view the estimation

2-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lmganon123/DeepSeek-V3-0324_IK_GGUF_Q2

Quantized
(22)
this model