ubergarm/GLM-4.5-GGUF

`ik_llama.cpp` imatrix Quantizations of zai-org/GLM-4.5

This quant collection REQUIRES ik_llama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!

NOTE ik_llama.cpp can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.

Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP with Windows builds for CUDA 12.9. Also check for Windows builds by Thireus here. which have been CUDA 12.8.

These quants provide best in class perplexity for the given memory footprint.

Big Thanks

Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!

Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models!

Quant Collection

Perplexity computed against wiki.test.raw.

Ahh jeeze the Perplexity not well behaved, pretty funny the IQ5_K has the "baseline" perplexity oof...

I ran some quick KLD comparisons as well which show how much the smaller quants deviate from the original BF16 outputs with the Cor(ln(PPL(Q)), ln(PPL(base))) metric:

Quant	`Cor(ln(PPL(Q)), ln(PPL(base)))`
BF16	Baseline
Q8_0	99.90%
IQ5_K	99.85%
IQ4_K	99.78%
IQ4_KSS	99.59%
IQ3_KT	99.33%
IQ2_KL	98.87%
IQ1_KT	96.52%

These first two are just test quants for baseline perplexity comparison:

BF16 667.598 GiB (16.003 BPW)
- Final estimate: PPL = 3.1788 +/- 0.01790
Q8_0 354.794 GiB (8.505 BPW)
- Final estimate: PPL = 3.1746 +/- 0.01784

IQ5_K 250.296 GiB (6.000 BPW)

Final estimate: PPL = 3.1690 +/- 0.01779

👈 Secret Recipe

#/usr/bin/env bash

custom="
# 93 Repeating Layers [0-92]

# Attention
blk\..*\.attn_q.*=q8_0
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=q8_0

# First 3 Dense Layers [0-2]
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

# Shared Expert Layers [3-92]
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

# Routed Experts Layers [3-92]
blk\..*\.ffn_down_exps\.weight=iq6_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq5_k

# NextN MTP Layer [92]
blk\..*\.nextn\.embed_tokens\.weight=iq6_k
blk\..*\.nextn\.shared_head_head\.weight=iq6_k
blk\..*\.nextn\.eh_proj\.weight=q8_0

# Non-Repeating Layers
token_embd\.weight=iq6_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 0 -m 0 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/GLM-4.5-GGUF/imatrix-GLM-4.5-BF16.dat \
    /mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-160x21B-4.5-BF16-00001-of-00015.gguf \
    /mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-4.5-IQ5_K.gguf \
    IQ5_K \
    192

IQ4_K 205.756 GiB (4.932 BPW)

Final estimate: PPL = 3.2189 +/- 0.01818

👈 Secret Recipe

#/usr/bin/env bash
custom="
# 93 Repeating Layers [0-92]

# Attention
blk\..*\.attn_q.*=iq6_k
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=iq6_k

# First 3 Dense Layers [0-2]
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=iq6_k

# Shared Expert Layers [3-92]
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=iq6_k

# Routed Experts Layers [3-92]
blk\..*\.ffn_down_exps\.weight=iq5_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_k

# NextN MTP Layer [92]
blk\..*\.nextn\.embed_tokens\.weight=iq5_k
blk\..*\.nextn\.shared_head_head\.weight=iq5_k
blk\..*\.nextn\.eh_proj\.weight=q8_0

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 0 -m 0 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/GLM-4.5-GGUF/imatrix-GLM-4.5-BF16.dat \
    /mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-160x21B-4.5-BF16-00001-of-00015.gguf \
    /mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-4.5-IQ4_K.gguf \
    IQ4_K \
    192

IQ4_KSS 173.726 GiB (4.164 BPW)

Final estimate: PPL = 3.3261 +/- 0.01899

👈 Secret Recipe

#/usr/bin/env bash

custom="
# 93 Repeating Layers [0-92]

# Attention
blk\.(0|1|2)\.attn_q.*=q8_0
blk\.(0|1|2)\.attn_k.*=q8_0
blk\.(0|1|2)\.attn_v.*=q8_0
blk\.(0|1|2)\.attn_output.*=q8_0

blk\..*\.attn_q.*=iq5_ks
blk\..*\.attn_k.*=iq6_k
blk\..*\.attn_v.*=iq6_k
blk\..*\.attn_output.*=iq5_ks

# First 3 Dense Layers [0-2]
blk\..*\.ffn_down\.weight=iq5_ks
blk\..*\.ffn_(gate|up)\.weight=iq4_ks

# Shared Expert Layers [3-92]
blk\..*\.ffn_down_shexp\.weight=iq5_ks
blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_ks

# Routed Experts Layers [3-92]
blk\..*\.ffn_down_exps\.weight=iq4_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss

# NextN MTP Layer [92]
blk\..*\.nextn\.embed_tokens\.weight=iq5_ks
blk\..*\.nextn\.shared_head_head\.weight=iq5_ks
blk\..*\.nextn\.eh_proj\.weight=q8_0

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 1 -m 1 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/GLM-4.5-GGUF/imatrix-GLM-4.5-BF16.dat \
    /mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-160x21B-4.5-BF16-00001-of-00015.gguf \
    /mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-4.5-IQ4_KSS.gguf \
    IQ4_KSS \
    192

IQ3_KT 147.565 GiB (3.537 BPW)

Final estimate: PPL = 3.4369 +/- 0.01975

Designed for Dual RTX 6000 Pro Blackwell 192GB VRAM full offload.

👈 Secret Recipe

#!/usr/bin/env bash

custom="
# 93 Repeating Layers [0-92]

# Attention
blk\.(0|1|2)\.attn_q.*=q8_0
blk\.(0|1|2)\.attn_k.*=q8_0
blk\.(0|1|2)\.attn_v.*=q8_0
blk\.(0|1|2)\.attn_output.*=q8_0

blk\..*\.attn_q.*=iq5_ks
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=iq5_ks

# First 3 Dense Layers [0-2]
blk\..*\.ffn_down\.weight=iq5_ks
blk\..*\.ffn_(gate|up)\.weight=iq4_ks

# Shared Expert Layers [3-92]
blk\..*\.ffn_down_shexp\.weight=iq5_ks
blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_ks

# Routed Experts Layers [3-92]
blk\..*\.ffn_down_exps\.weight=iq4_kss
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_kt

# NextN MTP Layer [92]
blk\..*\.nextn\.embed_tokens\.weight=iq5_ks
blk\..*\.nextn\.shared_head_head\.weight=iq5_ks
blk\..*\.nextn\.eh_proj\.weight=q8_0

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 1 -m 1 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/GLM-4.5-GGUF/imatrix-GLM-4.5-BF16.dat \
    /mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-160x21B-4.5-BF16-00001-of-00015.gguf \
    /mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-4.5-IQ3_KT.gguf \
    IQ3_KT \
    192

IQ2_KL 127.746 GiB (3.062 BPW)

Final estimate: PPL = 3.7569 +/- 0.02217

👈 Secret Recipe

#/usr/bin/env bash

custom="
# 93 Repeating Layers [0-92]

# Attention
blk\..*\.attn_q.*=iq5_ks
blk\..*\.attn_k.*=iq5_ks
blk\..*\.attn_v.*=iq5_ks
blk\..*\.attn_output.*=iq5_ks

# First 3 Dense Layers [0-2]
blk\..*\.ffn_down\.weight=iq5_ks
blk\..*\.ffn_(gate|up)\.weight=iq4_ks

# Shared Expert Layers [3-92]
blk\..*\.ffn_down_shexp\.weight=iq5_ks
blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_ks

# Routed Experts Layers [3-92]
blk\..*\.ffn_down_exps\.weight=iq3_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl

# NextN MTP Layer [92]
blk\..*\.nextn\.embed_tokens\.weight=iq5_ks
blk\..*\.nextn\.shared_head_head\.weight=iq5_ks
blk\..*\.nextn\.eh_proj\.weight=q8_0

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 1 -m 1 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/GLM-4.5-GGUF/imatrix-GLM-4.5-BF16.dat \
    /mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-160x21B-4.5-BF16-00001-of-00015.gguf \
    /mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-4.5-IQ2_KL.gguf \
    IQ2_KL \
    192

IQ1_KT 83.827 GiB (2.009 BPW)

Final estimate: PPL = Final estimate: PPL = 5.3270 +/- 0.03442

Good luck everybody! 😅

👈 Secret Recipe

#/usr/bin/env bash

custom="
# 93 Repeating Layers [0-92]

# Attention
blk\..*\.attn_q.*=iq4_kt
blk\..*\.attn_k.*=iq4_kt
blk\..*\.attn_v.*=iq4_kt
blk\..*\.attn_output.*=iq4_kt

# First 3 Dense Layers [0-2]
blk\..*\.ffn_down\.weight=iq4_kt
blk\..*\.ffn_(gate|up)\.weight=iq4_kt

# Shared Expert Layers [3-92]
blk\..*\.ffn_down_shexp\.weight=iq4_kt
blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_kt

# Routed Experts Layers [3-92]
blk\..*\.ffn_down_exps\.weight=iq2_kt
blk\..*\.ffn_(gate|up)_exps\.weight=iq1_kt

# NextN MTP Layer [92]
blk\..*\.nextn\.embed_tokens\.weight=iq4_kt
blk\..*\.nextn\.shared_head_head\.weight=iq4_kt
blk\..*\.nextn\.eh_proj\.weight=q8_0

# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 0 -m 0 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/GLM-4.5-GGUF/imatrix-GLM-4.5-BF16.dat \
    /mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-160x21B-4.5-BF16-00001-of-00015.gguf \
    /mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-4.5-IQ1_KT.gguf \
    IQ1_KT \
    192

Quick Start

If you want to disable thinking, add /nothink (correct, no underscore) at the end of your prompt.

# Clone and checkout
$ git clone https://github.com/ikawrakow/ik_llama.cpp
$ cd ik_llama.cpp

# Build for hybrid CPU+CUDA
$ cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1
$ cmake --build build --config Release -j $(nproc)

# Run API server
$ ./build/bin/llama-server \
    --model GLM-4.5-IQ4_KSS-00001-of-00004.gguf \
    --alias ubergarm/GLM-4.5-IQ4_KSS \
    --ctx-size 32768 \
    -fa -fmoe \
    -ctk q8_0 -ctv q8_0 \
    -ub 4096 -b 4096 \
    -ngl 99 \
    -ot exps=CPU \
    --parallel 1 \
    --threads 8 \
    --host 127.0.0.1 \
    --port 8080 \
    --no-mmap

ubergarm
/

GLM-4.5-GGUF

`ik_llama.cpp` imatrix Quantizations of zai-org/GLM-4.5

Big Thanks

Quant Collection

IQ5_K 250.296 GiB (6.000 BPW)

IQ4_K 205.756 GiB (4.932 BPW)

IQ4_KSS 173.726 GiB (4.164 BPW)

IQ3_KT 147.565 GiB (3.537 BPW)

IQ2_KL 127.746 GiB (3.062 BPW)

IQ1_KT 83.827 GiB (2.009 BPW)

Quick Start

References

Model tree for ubergarm/GLM-4.5-GGUF

ik_llama.cpp imatrix Quantizations of zai-org/GLM-4.5

Big Thanks

Quant Collection

IQ5_K 250.296 GiB (6.000 BPW)

IQ4_K 205.756 GiB (4.932 BPW)

IQ4_KSS 173.726 GiB (4.164 BPW)

IQ3_KT 147.565 GiB (3.537 BPW)

IQ2_KL 127.746 GiB (3.062 BPW)

IQ1_KT 83.827 GiB (2.009 BPW)

Quick Start

References

Model tree for ubergarm/GLM-4.5-GGUF

`ik_llama.cpp` imatrix Quantizations of zai-org/GLM-4.5