Hybrid inference (AMD NPU+GPU) for gpt-oss-20b
This is a version of gpt-oss-20b set up for hybrid NPU+GPU inference on AMD Ryzen AI hardware. (Lots of MatMuls are scheduled on the NPU). This should make it run faster than GPU-only implementations such as llama.cpp.
NOTE: this doesn't yet run on Ryzen AI. gpt-oss-20b uses MoEs with the Swiglu activation, which were added just recently to onnxruntime. AMD still needs to rebuild onnxruntime-genai-directml-ryzenai. Then, you should be able to run it in Lemonade.
How this was made
gpt-oss-20b-onnx converted gpt-oss-20b to ONNX, making sure to translate the MoE code to a QMoE ONNX operator. I took that and ran it through model_generate for hybrid inference, while editing out various bugs/incompatibilities. In particular, the hybrid_llm_gqo pass is removed (it doesn't support a bias term in the GQO MatMul) and the matmulnbits pass is skipped just for the LM head (its dimensions are incompatible). I didn't perform any further quantization.
- Downloads last month
- 10
Model tree for shivak/gpt-oss-20b-onnx-hybrid
Base model
openai/gpt-oss-20b