Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,15 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
base_model:
|
4 |
+
- openai/gpt-oss-20b
|
5 |
+
---
|
6 |
+
|
7 |
+
# Hybrid inference (AMD NPU+GPU) for gpt-oss-20b
|
8 |
+
|
9 |
+
This is a version of gpt-oss-20b set up for hybrid NPU+GPU inference on AMD Ryzen AI hardware. (Lots of MatMuls are scheduled on the NPU). This should make it run faster than GPU-only implementations such as llama.cpp.
|
10 |
+
|
11 |
+
**NOTE**: this doesn't yet run on Ryzen AI. gpt-oss-20b uses MoEs with the Swiglu activation, which were added [just recently](https://github.com/microsoft/onnxruntime/pull/25619) to onnxruntime. AMD still needs to rebuild [onnxruntime-genai-directml-ryzenai](https://pypi.amd.com/simple/onnxruntime-genai-directml-ryzenai/). Then, you should be able to run it in [Lemonade](https://lemonade-server.ai/).
|
12 |
+
|
13 |
+
## How this was made
|
14 |
+
|
15 |
+
[gpt-oss-20b-onnx](https://huggingface.co/onnxruntime/gpt-oss-20b-onnx) converted gpt-oss-20b to ONNX, making sure to translate the MoE code to a [QMoE ONNX operator](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.QMoE). I took that and ran it through model_generate for hybrid inference, while editing out various bugs/incompatibilities. In particular, the hybrid_llm_gqo pass is removed (it doesn't support a bias term in the GQO MatMul) and the matmulnbits pass is skipped just for the LM head (its dimensions are incompatible). I didn't perform any further quantization.
|