metadata

license: apache-2.0
base_model: Qwen/Qwen3-Embedding-8B
base_model_relation: quantized
tags:
  - gguf
  - quantized
  - llama.cpp
  - embeddings
model_type: qwen3
quantized_by: Jonathan Middleton

Qwen3-Embedding-8B-GGUF

Purpose

Multilingual text-embedding model in GGUF format for efficient CPU/GPU inference with llama.cpp and derivatives.

Files

Filename	Precision	Size*	Est. MTEB Δ vs FP16	Notes
`Qwen3-Embedding-8B-F16.gguf`	FP16	15.1 GB	0	Direct conversion; reference quality
`Qwen3-Embedding-8B-Q8_0.gguf`	Q8_0	8.6 GB	≈ +0.02	Full-precision parity for most tasks
`Qwen3-Embedding-8B-Q6_K.gguf`	Q6_K	6.9 GB	≈ +0.20	Balanced size / quality
`Qwen3-Embedding-8B-Q5_K_M.gguf`	Q5_K_M	6.16 GB	≈ +0.35	Good recall under tight memory
`Qwen3-Embedding-8B-Q4_K_M.gguf`	Q4_K_M	5.41 GB	≈ +0.60	Lowest-size CPU-friendly build

Upstream source

Repository : Qwen/Qwen3-Embedding-8B
Commit : 1d8ad4c (2025-07-12)
Licence : Apache-2.0

Conversion

Code base : llama.cpp commit a20f0a1 + PR #14029 (Qwen embedding support).

Command:

python convert_hf_to_gguf.py Qwen/Qwen3-Embedding-8B \
      --outfile Qwen3-Embedding-8B-F16.gguf \
      --leave-output-tensor \
      --outtype f16


BASE=$(basename "${SRC%.*}")  
DIR=$(dirname "$SRC")

EMB_OPT="--token-embedding-type F16 --leave-output-tensor"

for QT in Q4_K_M Q5_K_M Q6_K Q8_0; do
  OUT="${DIR}/${BASE}-${QT}.gguf"
  echo ">> quantising ${QT}  ->  $(basename "$OUT")"
  llama-quantize $EMB_OPT "$SRC" "$OUT" "$QT" $(nproc)
done