shivak commited on
Commit
a49cbd7
·
verified ·
1 Parent(s): c839aab

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -3
README.md CHANGED
@@ -1,3 +1,15 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - openai/gpt-oss-20b
5
+ ---
6
+
7
+ # Hybrid inference (AMD NPU+GPU) for gpt-oss-20b 
8
+
9
+ This is a version of gpt-oss-20b set up for hybrid NPU+GPU inference on AMD Ryzen AI hardware. (Lots of MatMuls are scheduled on the NPU). This should make it run faster than GPU-only implementations such as llama.cpp. 
10
+
11
+ **NOTE**: this doesn't yet run on Ryzen AI. gpt-oss-20b uses MoEs with the Swiglu activation, which were added [just recently](https://github.com/microsoft/onnxruntime/pull/25619) to onnxruntime. AMD still needs to rebuild [onnxruntime-genai-directml-ryzenai](https://pypi.amd.com/simple/onnxruntime-genai-directml-ryzenai/). Then, you should be able to run it in [Lemonade](https://lemonade-server.ai/). 
12
+
13
+ ## How this was made 
14
+
15
+ [gpt-oss-20b-onnx](https://huggingface.co/onnxruntime/gpt-oss-20b-onnx) converted gpt-oss-20b to ONNX, making sure to translate the MoE code to a [QMoE ONNX operator](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.QMoE). I took that and ran it through model_generate for hybrid inference, while editing out various bugs/incompatibilities. In particular, the hybrid_llm_gqo pass is removed (it doesn't support a bias term in the GQO MatMul) and the matmulnbits pass is skipped just for the LM head (its dimensions are incompatible). I didn't perform any further quantization.