File size: 5,324 Bytes
332d7b8
 
 
 
 
 
f512cdf
332d7b8
 
 
 
 
 
 
 
 
 
 
 
aa444a4
332d7b8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2e22a2d
 
 
332d7b8
 
 
 
2e22a2d
332d7b8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
license: mit
language:
- en
base_model:
- openbmb/MiniCPM4-0.5B
pipeline_tag: text-generation
tags:
- minicpm4
- int8
---


# MiniCPM4-0.5B-Int8

This version of MiniCPM4-0.5B has been converted to run on the Axera NPU using **w8a16** quantization.

This model has been optimized with the following LoRA: 

Compatible with Pulsar2 version: 4.2(Not released yet)

## Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo :
https://huggingface.co/openbmb/MiniCPM4-0.5B

[Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html) 

[AXera NPU LLM Runtime](https://github.com/AXERA-TECH/ax-llm) 


## Support Platform

- AX650
  - [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
  - [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)
- AX630C
  - [爱芯派2](https://axera-pi-2-docs-cn.readthedocs.io/zh-cn/latest/index.html)
  - [Module-LLM](https://docs.m5stack.com/zh_CN/module/Module-LLM)
  - [LLM630 Compute Kit](https://docs.m5stack.com/zh_CN/core/LLM630%20Compute%20Kit)
 
|Chips|w8a16|w4a16|
|--|--|--|
|AX650| 36 tokens/sec|TBD|
|AX630C| 12 tokens/sec|TBD|

## How to use

Download all files from this repository to the device

```
root@ax650:/mnt/qtang/llm-test/minicpm4-0.5b-ctx# tree -L 1
.
|-- main_ax650
|-- main_axcl_aarch64
|-- main_axcl_x86
|-- minicpm4-0.5b-int8-ctx-ax650
|-- minicpm4_tokenizer
|-- minicpm4_tokenizer_uid.py
|-- post_config.json
|-- run_minicpm4_0.5b_int8_ctx_ax650.sh
`-- run_minicpm4_0.5b_int8_ctx_axcl_x86.sh
2 directories, 7 files
```

#### Start the Tokenizer service

Install requirement

```
pip install transformers jinja2
```

```
root@ax650:/mnt/qtang/llm-test/minicpm4-0.5b-ctx# python3 minicpm4_tokenizer_uid.py
Server running at http://0.0.0.0:12345
```

#### Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board

Open another terminal and run `run_minicpm4_0.5b_int8_ctx_ax650.sh`

```
root@ax650:/mnt/qtang/llm-test/minicpm4-0.5b-ctx# ./run_minicpm4_0.5b_int8_ctx_ax650.sh
[I][                            Init][ 110]: LLM init start
[I][                            Init][  34]: connect http://127.0.0.1:12345 ok
[I][                            Init][  57]: uid: c779ded0-ff14-4877-869b-1aacc948f2d8
bos_id: 1, eos_id: 73440
100% | ████████████████████████████████ |  27 /  27 [2.53s<2.53s, 10.67 count/s] init post axmodel ok,remain_cmm(4244 MB)
[I][                            Init][ 188]: max_token_len : 1023
[I][                            Init][ 193]: kv_cache_size : 128, kv_cache_num: 1023
[I][                            Init][ 201]: prefill_token_num : 128
[I][                            Init][ 205]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 205]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 205]: grp: 3, prefill_max_token_num : 512
[I][                            Init][ 209]: prefill_max_token_num : 512
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": false,
    "enable_top_k_sampling": true,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 1,
    "top_p": 0.8
}

[I][                            Init][ 218]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
[I][          GenerateKVCachePrefill][ 271]: input token num : 25, prefill_split_num : 1 prefill_grpid : 2
[I][          GenerateKVCachePrefill][ 308]: input_num_token:25
[I][                            main][ 230]: precompute_len: 25
[I][                            main][ 231]: system_prompt: You are MiniCPM4, created by ModelBest. You are a helpful assistant.
prompt >> 你是谁?
[I][                      SetKVCache][ 531]: prefill_grpid:2 kv_cache_num:128 precompute_len:25 input_num_token:12
[I][                      SetKVCache][ 534]: current prefill_max_token_num:384
[I][                             Run][ 660]: input token num : 12, prefill_split_num : 1
[I][                             Run][ 686]: input_num_token:12
[I][                             Run][ 829]: ttft: 147.65 ms
你好,我是MiniCPM系列模型,由面壁智能和OpenBMB开源社区开发。详细信息请访问https://github.com/OpenBMB/

[N][                             Run][ 943]: hit eos,avg 35.75 token/s

[I][                      GetKVCache][ 500]: precompute_len:162, remaining:350
prompt >> 9.9与9.11
[I][                      SetKVCache][ 531]: prefill_grpid:3 kv_cache_num:512 precompute_len:162 input_num_token:17
[I][                      SetKVCache][ 534]: current prefill_max_token_num:256
[I][                             Run][ 660]: input token num : 17, prefill_split_num : 1
[I][                             Run][ 686]: input_num_token:17
[I][                             Run][ 829]: ttft: 274.38 ms
9.9比9.11大。

[N][                             Run][ 943]: hit eos,avg 35.44 token/s

[I][                      GetKVCache][ 500]: precompute_len:189, remaining:323
prompt >> q
root@ax650:/mnt/qtang/llm-test/minicpm4-0.5b-ctx#
```