ik_llama.cpp
imatrix Quantizations of zai-org/GLM-4.5
This quant collection REQUIRES ik_llama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
NOTE ik_llama.cpp
can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.
Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP with Windows builds for CUDA 12.9. Also check for Windows builds by Thireus here. which have been CUDA 12.8.
These quants provide best in class perplexity for the given memory footprint.
Big Thanks
Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!
Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models!
Quant Collection
Perplexity computed against wiki.test.raw.
Ahh jeeze the Perplexity not well behaved, pretty funny the IQ5_K has the "baseline" perplexity oof...
I ran some quick KLD comparisons as well which show how much the smaller quants deviate from the original BF16 outputs with the Cor(ln(PPL(Q)), ln(PPL(base)))
metric:
Quant | Cor(ln(PPL(Q)), ln(PPL(base))) |
---|---|
BF16 | Baseline |
Q8_0 | 99.90% |
IQ5_K | 99.85% |
IQ4_K | 99.78% |
IQ4_KSS | 99.59% |
IQ3_KT | 99.33% |
IQ2_KL | 98.87% |
IQ1_KT | 96.52% |
These first two are just test quants for baseline perplexity comparison:
BF16
667.598 GiB (16.003 BPW)- Final estimate: PPL = 3.1788 +/- 0.01790
Q8_0
354.794 GiB (8.505 BPW)- Final estimate: PPL = 3.1746 +/- 0.01784
IQ5_K 250.296 GiB (6.000 BPW)
Final estimate: PPL = 3.1690 +/- 0.01779
👈 Secret Recipe
#/usr/bin/env bash
custom="
# 93 Repeating Layers [0-92]
# Attention
blk\..*\.attn_q.*=q8_0
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=q8_0
# First 3 Dense Layers [0-2]
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0
# Shared Expert Layers [3-92]
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0
# Routed Experts Layers [3-92]
blk\..*\.ffn_down_exps\.weight=iq6_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq5_k
# NextN MTP Layer [92]
blk\..*\.nextn\.embed_tokens\.weight=iq6_k
blk\..*\.nextn\.shared_head_head\.weight=iq6_k
blk\..*\.nextn\.eh_proj\.weight=q8_0
# Non-Repeating Layers
token_embd\.weight=iq6_k
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
numactl -N 0 -m 0 \
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/raid/models/ubergarm/GLM-4.5-GGUF/imatrix-GLM-4.5-BF16.dat \
/mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-160x21B-4.5-BF16-00001-of-00015.gguf \
/mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-4.5-IQ5_K.gguf \
IQ5_K \
192
IQ4_K 205.756 GiB (4.932 BPW)
Final estimate: PPL = 3.2189 +/- 0.01818
👈 Secret Recipe
#/usr/bin/env bash
custom="
# 93 Repeating Layers [0-92]
# Attention
blk\..*\.attn_q.*=iq6_k
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=iq6_k
# First 3 Dense Layers [0-2]
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=iq6_k
# Shared Expert Layers [3-92]
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=iq6_k
# Routed Experts Layers [3-92]
blk\..*\.ffn_down_exps\.weight=iq5_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_k
# NextN MTP Layer [92]
blk\..*\.nextn\.embed_tokens\.weight=iq5_k
blk\..*\.nextn\.shared_head_head\.weight=iq5_k
blk\..*\.nextn\.eh_proj\.weight=q8_0
# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
numactl -N 0 -m 0 \
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/raid/models/ubergarm/GLM-4.5-GGUF/imatrix-GLM-4.5-BF16.dat \
/mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-160x21B-4.5-BF16-00001-of-00015.gguf \
/mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-4.5-IQ4_K.gguf \
IQ4_K \
192
IQ4_KSS 173.726 GiB (4.164 BPW)
Final estimate: PPL = 3.3261 +/- 0.01899
👈 Secret Recipe
#/usr/bin/env bash
custom="
# 93 Repeating Layers [0-92]
# Attention
blk\.(0|1|2)\.attn_q.*=q8_0
blk\.(0|1|2)\.attn_k.*=q8_0
blk\.(0|1|2)\.attn_v.*=q8_0
blk\.(0|1|2)\.attn_output.*=q8_0
blk\..*\.attn_q.*=iq5_ks
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_ks
# First 3 Dense Layers [0-2]
blk\..*\.ffn_down\.weight=iq5_ks
blk\..*\.ffn_(gate|up)\.weight=iq4_ks
# Shared Expert Layers [3-92]
blk\..*\.ffn_down_shexp\.weight=iq5_ks
blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_ks
# Routed Experts Layers [3-92]
blk\..*\.ffn_down_exps\.weight=iq4_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss
# NextN MTP Layer [92]
blk\..*\.nextn\.embed_tokens\.weight=iq5_ks
blk\..*\.nextn\.shared_head_head\.weight=iq5_ks
blk\..*\.nextn\.eh_proj\.weight=q8_0
# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
numactl -N 1 -m 1 \
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/raid/models/ubergarm/GLM-4.5-GGUF/imatrix-GLM-4.5-BF16.dat \
/mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-160x21B-4.5-BF16-00001-of-00015.gguf \
/mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-4.5-IQ4_KSS.gguf \
IQ4_KSS \
192
IQ3_KT 147.565 GiB (3.537 BPW)
Final estimate: PPL = 3.4369 +/- 0.01975
Designed for Dual RTX 6000 Pro Blackwell 192GB VRAM full offload.
👈 Secret Recipe
#!/usr/bin/env bash
custom="
# 93 Repeating Layers [0-92]
# Attention
blk\.(0|1|2)\.attn_q.*=q8_0
blk\.(0|1|2)\.attn_k.*=q8_0
blk\.(0|1|2)\.attn_v.*=q8_0
blk\.(0|1|2)\.attn_output.*=q8_0
blk\..*\.attn_q.*=iq5_ks
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=iq5_ks
# First 3 Dense Layers [0-2]
blk\..*\.ffn_down\.weight=iq5_ks
blk\..*\.ffn_(gate|up)\.weight=iq4_ks
# Shared Expert Layers [3-92]
blk\..*\.ffn_down_shexp\.weight=iq5_ks
blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_ks
# Routed Experts Layers [3-92]
blk\..*\.ffn_down_exps\.weight=iq4_kss
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_kt
# NextN MTP Layer [92]
blk\..*\.nextn\.embed_tokens\.weight=iq5_ks
blk\..*\.nextn\.shared_head_head\.weight=iq5_ks
blk\..*\.nextn\.eh_proj\.weight=q8_0
# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
numactl -N 1 -m 1 \
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/raid/models/ubergarm/GLM-4.5-GGUF/imatrix-GLM-4.5-BF16.dat \
/mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-160x21B-4.5-BF16-00001-of-00015.gguf \
/mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-4.5-IQ3_KT.gguf \
IQ3_KT \
192
IQ2_KL 127.746 GiB (3.062 BPW)
Final estimate: PPL = 3.7569 +/- 0.02217
👈 Secret Recipe
#/usr/bin/env bash
custom="
# 93 Repeating Layers [0-92]
# Attention
blk\..*\.attn_q.*=iq5_ks
blk\..*\.attn_k.*=iq5_ks
blk\..*\.attn_v.*=iq5_ks
blk\..*\.attn_output.*=iq5_ks
# First 3 Dense Layers [0-2]
blk\..*\.ffn_down\.weight=iq5_ks
blk\..*\.ffn_(gate|up)\.weight=iq4_ks
# Shared Expert Layers [3-92]
blk\..*\.ffn_down_shexp\.weight=iq5_ks
blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_ks
# Routed Experts Layers [3-92]
blk\..*\.ffn_down_exps\.weight=iq3_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl
# NextN MTP Layer [92]
blk\..*\.nextn\.embed_tokens\.weight=iq5_ks
blk\..*\.nextn\.shared_head_head\.weight=iq5_ks
blk\..*\.nextn\.eh_proj\.weight=q8_0
# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
numactl -N 1 -m 1 \
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/raid/models/ubergarm/GLM-4.5-GGUF/imatrix-GLM-4.5-BF16.dat \
/mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-160x21B-4.5-BF16-00001-of-00015.gguf \
/mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-4.5-IQ2_KL.gguf \
IQ2_KL \
192
IQ1_KT 83.827 GiB (2.009 BPW)
Final estimate: PPL = Final estimate: PPL = 5.3270 +/- 0.03442
Good luck everybody! 😅
👈 Secret Recipe
#/usr/bin/env bash
custom="
# 93 Repeating Layers [0-92]
# Attention
blk\..*\.attn_q.*=iq4_kt
blk\..*\.attn_k.*=iq4_kt
blk\..*\.attn_v.*=iq4_kt
blk\..*\.attn_output.*=iq4_kt
# First 3 Dense Layers [0-2]
blk\..*\.ffn_down\.weight=iq4_kt
blk\..*\.ffn_(gate|up)\.weight=iq4_kt
# Shared Expert Layers [3-92]
blk\..*\.ffn_down_shexp\.weight=iq4_kt
blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_kt
# Routed Experts Layers [3-92]
blk\..*\.ffn_down_exps\.weight=iq2_kt
blk\..*\.ffn_(gate|up)_exps\.weight=iq1_kt
# NextN MTP Layer [92]
blk\..*\.nextn\.embed_tokens\.weight=iq4_kt
blk\..*\.nextn\.shared_head_head\.weight=iq4_kt
blk\..*\.nextn\.eh_proj\.weight=q8_0
# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
numactl -N 0 -m 0 \
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/raid/models/ubergarm/GLM-4.5-GGUF/imatrix-GLM-4.5-BF16.dat \
/mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-160x21B-4.5-BF16-00001-of-00015.gguf \
/mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-4.5-IQ1_KT.gguf \
IQ1_KT \
192
Quick Start
If you want to disable thinking, add /nothink
(correct, no underscore) at the end of your prompt.
# Clone and checkout
$ git clone https://github.com/ikawrakow/ik_llama.cpp
$ cd ik_llama.cpp
# Build for hybrid CPU+CUDA
$ cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1
$ cmake --build build --config Release -j $(nproc)
# Run API server
$ ./build/bin/llama-server \
--model GLM-4.5-IQ4_KSS-00001-of-00004.gguf \
--alias ubergarm/GLM-4.5-IQ4_KSS \
--ctx-size 32768 \
-fa -fmoe \
-ctk q8_0 -ctv q8_0 \
-ub 4096 -b 4096 \
-ngl 99 \
-ot exps=CPU \
--parallel 1 \
--threads 8 \
--host 127.0.0.1 \
--port 8080 \
--no-mmap
References
- Downloads last month
- 10,495
2-bit
Model tree for ubergarm/GLM-4.5-GGUF
Base model
zai-org/GLM-4.5