File size: 5,324 Bytes
332d7b8 f512cdf 332d7b8 aa444a4 332d7b8 2e22a2d 332d7b8 2e22a2d 332d7b8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
---
license: mit
language:
- en
base_model:
- openbmb/MiniCPM4-0.5B
pipeline_tag: text-generation
tags:
- minicpm4
- int8
---
# MiniCPM4-0.5B-Int8
This version of MiniCPM4-0.5B has been converted to run on the Axera NPU using **w8a16** quantization.
This model has been optimized with the following LoRA:
Compatible with Pulsar2 version: 4.2(Not released yet)
## Convert tools links:
For those who are interested in model conversion, you can try to export axmodel through the original repo :
https://huggingface.co/openbmb/MiniCPM4-0.5B
[Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html)
[AXera NPU LLM Runtime](https://github.com/AXERA-TECH/ax-llm)
## Support Platform
- AX650
- [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
- [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)
- AX630C
- [爱芯派2](https://axera-pi-2-docs-cn.readthedocs.io/zh-cn/latest/index.html)
- [Module-LLM](https://docs.m5stack.com/zh_CN/module/Module-LLM)
- [LLM630 Compute Kit](https://docs.m5stack.com/zh_CN/core/LLM630%20Compute%20Kit)
|Chips|w8a16|w4a16|
|--|--|--|
|AX650| 36 tokens/sec|TBD|
|AX630C| 12 tokens/sec|TBD|
## How to use
Download all files from this repository to the device
```
root@ax650:/mnt/qtang/llm-test/minicpm4-0.5b-ctx# tree -L 1
.
|-- main_ax650
|-- main_axcl_aarch64
|-- main_axcl_x86
|-- minicpm4-0.5b-int8-ctx-ax650
|-- minicpm4_tokenizer
|-- minicpm4_tokenizer_uid.py
|-- post_config.json
|-- run_minicpm4_0.5b_int8_ctx_ax650.sh
`-- run_minicpm4_0.5b_int8_ctx_axcl_x86.sh
2 directories, 7 files
```
#### Start the Tokenizer service
Install requirement
```
pip install transformers jinja2
```
```
root@ax650:/mnt/qtang/llm-test/minicpm4-0.5b-ctx# python3 minicpm4_tokenizer_uid.py
Server running at http://0.0.0.0:12345
```
#### Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board
Open another terminal and run `run_minicpm4_0.5b_int8_ctx_ax650.sh`
```
root@ax650:/mnt/qtang/llm-test/minicpm4-0.5b-ctx# ./run_minicpm4_0.5b_int8_ctx_ax650.sh
[I][ Init][ 110]: LLM init start
[I][ Init][ 34]: connect http://127.0.0.1:12345 ok
[I][ Init][ 57]: uid: c779ded0-ff14-4877-869b-1aacc948f2d8
bos_id: 1, eos_id: 73440
100% | ████████████████████████████████ | 27 / 27 [2.53s<2.53s, 10.67 count/s] init post axmodel ok,remain_cmm(4244 MB)
[I][ Init][ 188]: max_token_len : 1023
[I][ Init][ 193]: kv_cache_size : 128, kv_cache_num: 1023
[I][ Init][ 201]: prefill_token_num : 128
[I][ Init][ 205]: grp: 1, prefill_max_token_num : 1
[I][ Init][ 205]: grp: 2, prefill_max_token_num : 128
[I][ Init][ 205]: grp: 3, prefill_max_token_num : 512
[I][ Init][ 209]: prefill_max_token_num : 512
[I][ load_config][ 282]: load config:
{
"enable_repetition_penalty": false,
"enable_temperature": false,
"enable_top_k_sampling": true,
"enable_top_p_sampling": false,
"penalty_window": 20,
"repetition_penalty": 1.2,
"temperature": 0.9,
"top_k": 1,
"top_p": 0.8
}
[I][ Init][ 218]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
[I][ GenerateKVCachePrefill][ 271]: input token num : 25, prefill_split_num : 1 prefill_grpid : 2
[I][ GenerateKVCachePrefill][ 308]: input_num_token:25
[I][ main][ 230]: precompute_len: 25
[I][ main][ 231]: system_prompt: You are MiniCPM4, created by ModelBest. You are a helpful assistant.
prompt >> 你是谁?
[I][ SetKVCache][ 531]: prefill_grpid:2 kv_cache_num:128 precompute_len:25 input_num_token:12
[I][ SetKVCache][ 534]: current prefill_max_token_num:384
[I][ Run][ 660]: input token num : 12, prefill_split_num : 1
[I][ Run][ 686]: input_num_token:12
[I][ Run][ 829]: ttft: 147.65 ms
你好,我是MiniCPM系列模型,由面壁智能和OpenBMB开源社区开发。详细信息请访问https://github.com/OpenBMB/
[N][ Run][ 943]: hit eos,avg 35.75 token/s
[I][ GetKVCache][ 500]: precompute_len:162, remaining:350
prompt >> 9.9与9.11
[I][ SetKVCache][ 531]: prefill_grpid:3 kv_cache_num:512 precompute_len:162 input_num_token:17
[I][ SetKVCache][ 534]: current prefill_max_token_num:256
[I][ Run][ 660]: input token num : 17, prefill_split_num : 1
[I][ Run][ 686]: input_num_token:17
[I][ Run][ 829]: ttft: 274.38 ms
9.9比9.11大。
[N][ Run][ 943]: hit eos,avg 35.44 token/s
[I][ GetKVCache][ 500]: precompute_len:189, remaining:323
prompt >> q
root@ax650:/mnt/qtang/llm-test/minicpm4-0.5b-ctx#
``` |