Llama.cpp hybrid layer quantization of gpt-oss-20b by openai
Original model: https://huggingface.co/openai/gpt-oss-20b
WARNING: EITHER THIS MODEL or LLAMA.CPP has a major bug as of 08/07/2025. The perplexity evaluation of the model is very bad due to incorrect token probability distribution : https://github.com/ggml-org/llama.cpp/issues/15155 This problem needs to be addressed before the model can be used confidently. Most likely the bug is related to the custom swiglu with clip and/or RMS layer norms for the model being way off, resulting in output probs all very similar and low value and causing generation instability. The entire need for this hybrid quant may be related to this bug so expect the quant to be updated, or even unecessary, once the layer norm problem is resolved.
The hybrid quant employs different quantization levels on a per layer basis. For this model, the hybrid layer quant is used to help stabilize generation (as much as possible) with greedy decode to allow direct greedy decode for highest probability solutions and/or enable high probability soltuions with lower temp (such as 0.2) to be used.
For this file the layer quants are as follows:
LAYER_TYPES='[
[0 ,"MXFP4" ],[1 ,"MXFP4" ],[2 ,"Q8_0" ],[3 ,"MXFP4" ],[4 ,"MXFP4" ],[5 ,"MXFP4" ],[6 ,"MXFP4" ],[7 ,"MXFP4" ],
[8 ,"MXFP4" ],[9 ,"MXFP4" ],[10,"MXFP4" ],[11,"MXFP4" ],[12,"MXFP4" ],[13,"MXFP4" ],[14,"MXFP4" ],[15,"MXFP4" ],
[16,"MXFP4" ],[17,"MXFP4" ],[18,"MXFP4" ],[19,"MXFP4" ],[20,"MXFP4" ],[21,"MXFP4" ],[22,"MXFP4" ],[23,"Q8_0" ]
]'
FLAGS="--allow-requantize --token-embedding-type Q4_0 --output-tensor-type Q4_0 --layer-types-high"
The layer quants were optimized for stable (as possible) generation using both -ot exps=CPU (model evaluated on CPU) and full cuda offload of the model using 2 4070s and RPC. The homogenous MXFP4 quant with token embedding at Q8_0 and output tensor at Q8_0 results in the model falling into infinite repeat patterns of varying length on most generations when using greedy decode. The primary mechanism used to combat this effect is to add controlled level of nonlinearity by setting token embedding and output tensor both to Q4_0. This somewhat stabilizes both CPU decode and full cuda offload in the presence of the llama.cpp layer norm bug for the model when combined with use a specific system prompt documented below.
Comparison:
Quant | size | PPL | Comment |
---|---|---|---|
MXFP4 | 12.1e9 | 459 | Q8_0 embed and output, massively unstable with greedy sampling |
MXFP4_H | 12.4e9 | 300.5 | Q4_0 embed Q4_0 output, borderline stable with greedy sampling |
The above PPL were computed using llama-perplexity and are a red flag that something major is broke.
Usage:
This is a RL trained moe thinking model. The model can be efficiently run by offloading expert tensors to CPU via -ot exps=CPU to open up very large context space. It can also run fully offloaded on GPU via RPC or high VRAM GPU.
The model has not been tested with speculation, but is pretty fast for both CPU and GPU inference mode due to its being a moe:
Config | non speculated gen speed |
---|---|
2 4070, RPC, fully offloaded to GPU | 62 t/s |
1 4070, -ot exps=CPU, CPU=9900k | 18 t/s |
System prompt:
A system prompt is needed to be used with this model. The following system prompt was found to be necessary to help stop generation instability and block tool calls, along with the hybrid layer quant. The prompt defined below in shell syntax is recommend to be used, verbatim, together with the quant:
if [[ ! $EFFORT ]]; then
EFFORT=medium
fi
SYSTEM="Knowledge cutoff: 2024-06
Current date: 2025-??-??
Reasoning: $EFFORT
Never use tool calls in any responses.
"
Further tests show this system prompt also works well combined with the hybrid quant:
SYSTEM="Knowledge cutoff: 2024-06
Current date: 2025-??-??
Reasoning: $EFFORT
Do not use tool calls.
"
The trailing nl is signficant and makes a difference in stabilizing the output as the model appears to be right on the fringe of instability even using the hybrid layer quant. This system prompt voodoo helps kick good initial numbers into the autoregressive feedback to bootstrap the buggy metastable model into good generations which (mostly, but not always) don't go into rep loops.
For deterministic outputs do not enter the current date, leave it as ??-?? so the generation does not change when the date changes. This model will also output tool calls by default, so the system prompt is used to shut that off if the inference platform does not support the openai syntax tool calls.
ROPE:
The model uses ROPE YARN to extend context. It is known that use of ROPE with long contexts degrades inference performance. Therefore the following configuration for ROPE can be used with a context sized at 32k tokens which should be more than adequate for most problems:
--rope-scaling yarn --rope-scale 8 --yarn-orig-ctx 4096
If context <32k is used, then set rope scale to the value context_length / 4096 (example, 8192 context would be 2.0)
Long context test:
A long context problem of 85k tokens was given to the model and found to be unusably slow for both prompt processing of the 85k prompt and subsequent generation, which promptly went into a rep loop due to borderline instability of model. Llama.cpp b6100 was used for test. More info on slow processing: https://github.com/ggml-org/llama.cpp/issues/15163
Benchmarks:
Evals for the model will eventually be given here: https://huggingface.co/spaces/steampunque/benchlm.
Download the file from below:
Link | Type | Size/e9 B | Notes |
---|---|---|---|
gpt-oss-20b.MXFP4_H.gguf | MXFP4_H | 12.4e9 B | ~MXFP4 size |
A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:
- Downloads last month
- 1,747
Model tree for steampunque/gpt-oss-20b-Hybrid-GGUF
Base model
openai/gpt-oss-20b