zamroni111 commited on
Commit
9d832d0
·
verified ·
1 Parent(s): d83b83e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -1
README.md CHANGED
@@ -13,4 +13,85 @@ base_model:
13
  tags:
14
  - directml
15
  - windows
16
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  tags:
14
  - directml
15
  - windows
16
+ ---
17
+ # Model Card for Model ID
18
+
19
+ ## Model Details
20
+ deepseek-ai/DeepSeek-R1-Distill-Llama-8B quantized to ONNX GenAI INT4 with Microsoft DirectML optimization.<br>
21
+ Output is reformatted that each sentence starts at new line to improve readability.
22
+ Output will start with COTS/reasoning.
23
+ <pre>
24
+ ...
25
+ vNewDecoded = tokenizer_stream.decode(new_token)
26
+ if re.findall("^[\x2E\x3A\x3B]$", vPreviousDecoded) and vNewDecoded.startswith(" ") and (not vNewDecoded.startswith(" *")) :
27
+ vNewDecoded = "\n" + vNewDecoded.replace(" ", "", 1)
28
+ print(vNewDecoded, end='', flush=True)
29
+ vPreviousDecoded = vNewDecoded
30
+ ...
31
+ </pre>
32
+
33
+ ### Model Description
34
+ meta-llama/Meta-Llama-3.1-8B-Instruct quantized to ONNX GenAI INT4 with Microsoft DirectML optimization<br>
35
+ https://onnxruntime.ai/docs/genai/howto/install.html#directml
36
+
37
+ Created using ONNX Runtime GenAI's builder.py<br>
38
+ https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/src/python/py/models/builder.py
39
+
40
+ Build options:<br>
41
+ INT4 accuracy level: FP32 (float32)
42
+
43
+ - **Developed by:** Mochamad Aris Zamroni
44
+
45
+ ### Model Sources [optional]
46
+ https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B
47
+
48
+ ### Direct Use
49
+ This is Microsoft Windows DirectML optimized model.<br>
50
+ It might not be working in ONNX execution provider other than DmlExecutionProvider.<br>
51
+ The needed python scripts are included in this repository
52
+
53
+ Prerequisites:<br>
54
+ 1. Install Python 3.10 from Windows Store:<br>
55
+ https://apps.microsoft.com/detail/9pjpw5ldxlz5?hl=en-us&gl=US
56
+
57
+ 2. Open command line cmd.exe
58
+
59
+ 3. Create python virtual environment, activate the environment then install onnxruntime-genai-directml<br>
60
+ mkdir c:\temp<br>
61
+ cd c:\temp<br>
62
+ python -m venv dmlgenai<br>
63
+ dmlgenai\Scripts\activate.bat<br>
64
+ pip install onnxruntime-genai-directml
65
+
66
+ 4. Use the onnxgenairun.py to get chat interface.<br>
67
+ It is modified version of "https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3-qa.py".<br>
68
+ The modification makes the text output changes to new line after "., :, and ;" to make the output easier to be read.
69
+
70
+ rem Change directory to where model and script files is stored<br>
71
+ cd this_onnx_model_directory<br>
72
+ python onnxgenairun.py --help<br>
73
+ python onnxgenairun.py -m . -v -g
74
+
75
+ 5. (Optional but recommended) Device specific optimization.<br>
76
+ a. Open "dml-device-specific-optim.py" with text editor and change the file path accordingly.<br>
77
+ b. Run the python script: python dml-device-specific-optim.py<br>
78
+ c. Rename the original model.onnx to other file name and put and rename the optimized onnx file from step 5.b to model.onnx file.<br>
79
+ d. Rerun step 4.
80
+
81
+ #### Speeds, Sizes, Times [optional]
82
+ 15 token/s in Radeon 780M with 8GB pre-allocated RAM.<br>
83
+ Increase to 16 token/s with device specific optimized model.onnx.<br>
84
+ As comparison, LM Studio using GGUF INT4 model and VulkanML GPU acceleration runs at 13 token/s.
85
+
86
+ #### Hardware
87
+ AMD Ryzen Zen4 7840U with integrated Radeon 780M GPU<br>
88
+ RAM 32GB<br>
89
+
90
+ #### Software
91
+ Microsoft DirectML on Windows 10
92
+
93
+ ## Model Card Authors [optional]
94
+ Mochamad Aris Zamroni
95
+
96
+ ## Model Card Contact
97
+ https://www.linkedin.com/in/zamroni/