Hybrid inference (AMD NPU+GPU) for gpt-oss-20b

This is a version of gpt-oss-20b set up for hybrid NPU+GPU inference on AMD Ryzen AI hardware. (Lots of MatMuls are scheduled on the NPU). This should make it run faster than GPU-only implementations such as llama.cpp.

NOTE: this doesn't yet run on Ryzen AI. gpt-oss-20b uses MoEs with the Swiglu activation, which were added just recently to onnxruntime. AMD still needs to rebuild onnxruntime-genai-directml-ryzenai. Then, you should be able to run it in Lemonade.

How this was made

gpt-oss-20b-onnx converted gpt-oss-20b to ONNX, making sure to translate the MoE code to a QMoE ONNX operator. I took that and ran it through model_generate for hybrid inference, while editing out various bugs/incompatibilities. In particular, the hybrid_llm_gqo pass is removed (it doesn't support a bias term in the GQO MatMul) and the matmulnbits pass is skipped just for the LM head (its dimensions are incompatible). I didn't perform any further quantization.

shivak
/

gpt-oss-20b-onnx-hybrid

Hybrid inference (AMD NPU+GPU) for gpt-oss-20b

How this was made

Model tree for shivak/gpt-oss-20b-onnx-hybrid