shivak
/

gpt-oss-20b-onnx-hybrid

Model card Files Files and versions

shivak commited on Aug 11

Commit

a49cbd7

·

verified ·

1 Parent(s): c839aab

Update README.md

Files changed (1) hide show

README.md +15 -3

README.md CHANGED Viewed

@@ -1,3 +1,15 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+base_model:
+- openai/gpt-oss-20b
+---
+# Hybrid inference (AMD NPU+GPU) for gpt-oss-20b
+This is a version of gpt-oss-20b set up for hybrid NPU+GPU inference on AMD Ryzen AI hardware. (Lots of MatMuls are scheduled on the NPU). This should make it run faster than GPU-only implementations such as llama.cpp.
+**NOTE**: this doesn't yet run on Ryzen AI. gpt-oss-20b uses MoEs with the Swiglu activation, which were added [just recently](https://github.com/microsoft/onnxruntime/pull/25619) to onnxruntime. AMD still needs to rebuild [onnxruntime-genai-directml-ryzenai](https://pypi.amd.com/simple/onnxruntime-genai-directml-ryzenai/). Then, you should be able to run it in [Lemonade](https://lemonade-server.ai/).
+## How this was made
+[gpt-oss-20b-onnx](https://huggingface.co/onnxruntime/gpt-oss-20b-onnx) converted gpt-oss-20b to ONNX, making sure to translate the MoE code to a [QMoE ONNX operator](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.QMoE). I took that and ran it through model_generate for hybrid inference, while editing out various bugs/incompatibilities. In particular, the hybrid_llm_gqo pass is removed (it doesn't support a bias term in the GQO MatMul) and the matmulnbits pass is skipped just for the LM head (its dimensions are incompatible). I didn't perform any further quantization.