onnx-community
/

DeepSeek-R1-Distill-Llama-8B-ONNX-DirectML-GenAI-INT4

 tags:
 - directml
 - windows
+---
+# Model Card for Model ID
+## Model Details
+deepseek-ai/DeepSeek-R1-Distill-Llama-8B quantized to ONNX GenAI INT4 with Microsoft DirectML optimization.<br>
+Output is reformatted that each sentence starts at new line to improve readability.
+Output will start with COTS/reasoning.
+<pre>
+...
+vNewDecoded = tokenizer_stream.decode(new_token)
+if re.findall("^[\x2E\x3A\x3B]$", vPreviousDecoded) and vNewDecoded.startswith(" ") and (not vNewDecoded.startswith(" *")) :
+   vNewDecoded = "\n" + vNewDecoded.replace(" ", "", 1)
+print(vNewDecoded, end='', flush=True)
+vPreviousDecoded = vNewDecoded
+...
+</pre>
+### Model Description
+meta-llama/Meta-Llama-3.1-8B-Instruct quantized to ONNX GenAI INT4 with Microsoft DirectML optimization<br>
+https://onnxruntime.ai/docs/genai/howto/install.html#directml
+Created using ONNX Runtime GenAI's builder.py<br>
+https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/src/python/py/models/builder.py
+Build options:<br>
+INT4 accuracy level: FP32 (float32)
+- **Developed by:** Mochamad Aris Zamroni
+### Model Sources [optional]
+https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B
+### Direct Use
+This is Microsoft Windows DirectML optimized model.<br>
+It might not be working in ONNX execution provider other than DmlExecutionProvider.<br>
+The needed python scripts are included in this repository
+Prerequisites:<br>
+1. Install Python 3.10 from Windows Store:<br>
+https://apps.microsoft.com/detail/9pjpw5ldxlz5?hl=en-us&gl=US
+2. Open command line cmd.exe
+3. Create python virtual environment, activate the environment then install onnxruntime-genai-directml<br>
+mkdir c:\temp<br>
+cd c:\temp<br>
+python -m venv dmlgenai<br>
+dmlgenai\Scripts\activate.bat<br>
+pip install onnxruntime-genai-directml
+4. Use the onnxgenairun.py to get chat interface.<br>
+It is modified version of "https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3-qa.py".<br>
+The modification makes the text output changes to new line after "., :, and ;" to make the output easier to be read.
+rem Change directory to where model and script files is stored<br>
+cd this_onnx_model_directory<br>
+python onnxgenairun.py --help<br>
+python onnxgenairun.py -m . -v -g
+5. (Optional but recommended) Device specific optimization.<br>
+a. Open "dml-device-specific-optim.py" with text editor and change the file path accordingly.<br>
+b. Run the python script: python dml-device-specific-optim.py<br>
+c. Rename the original model.onnx to other file name and put and rename the optimized onnx file from step 5.b to model.onnx file.<br>
+d. Rerun step 4.
+#### Speeds, Sizes, Times [optional]
+15 token/s in Radeon 780M with 8GB pre-allocated RAM.<br>
+Increase to 16 token/s with device specific optimized model.onnx.<br>
+As comparison, LM Studio using GGUF INT4 model and VulkanML GPU acceleration runs at 13 token/s.
+#### Hardware
+AMD Ryzen Zen4 7840U with integrated Radeon 780M GPU<br>
+RAM 32GB<br>
+#### Software
+Microsoft DirectML on Windows 10
+## Model Card Authors [optional]
+Mochamad Aris Zamroni
+## Model Card Contact
+https://www.linkedin.com/in/zamroni/