Multimodal Models
Collection
15 items
•
Updated
This version of Janus-Pro-1B has been converted to run on the Axera NPU using w8a16 quantization.
This model has been optimized with the following LoRA:
Compatible with Pulsar2 version: 3.4
For those who are interested in model conversion, you can try to export axmodel through the original repo : https://huggingface.co/deepseek-ai/Janus-Pro-1B
chips | image encoder 384 | ttft | w8a16 |
---|---|---|---|
AX650 | 142.682 ms | 4560.214 ms | 11.43 tokens/sec |
Download all files from this repository to the device.
Using AX650 Board
root@ax650 ~/yongqiang/push_hugging_face/Janus-Pro-1B # tree -L 1
.
├── assets
├── config.json
├── embeds
├── img_gen_onnx
├── imgs
├── infer_axmodel_gen.py
├── infer_axmodel_und.py
├── janus_pro_1b_axmodel
├── janus_pro_1b_tokenizer
├── README.md
└── vit_axmodel
8 directories, 3 files
$ git clone https://github.com/deepseek-ai/Janus
$ cd Janus
$ pip3 install -e .
Multimodal Understanding
input text:
Please describe the picture.
log information:
root@ax650 ~/yongqiang/push_hugging_face/Janus-Pro-1B # python3 infer_axmodel_und.py --tokenizer_dir janus_pro_1b_tokenizer --axmodel_path janus_pro_1b_axmodel --vit_axmodel_path vit_axmodel/janus_warp_vit.axmodel -i ./imgs/image.png
[INFO] Available providers: ['AxEngineExecutionProvider']
[INFO] Chip type: ChipType.MC50
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Engine version: 2.11.0a
vit_output.shape is (1, 576, 2048), vit feature extract done!
Init InferenceSession: 100%|██████████████████████████████████████████████████████████| 24/24 [00:04<00:00, 4.94it/s]
model load done!
prefill done!
Decoder: 62%|█████████████████████████████████████████▍ | 634/1024 [00:00<00:00, 2505.28it/s]Decoder: 72%|█████████████████████████████████████████████████▉ | 741/1024 [00:19<00:10, 27.69it/s]hit eos!
Decoder: 74%|███████████████████████████████████████████████████▎ | 762/1024 [00:23<00:08, 31.84it/s]
Janus Answers: The image depicts three astronauts standing in a lush, green forest. They are wearing traditional white space suits with various patches and equipment attached. The suits have a reflective visor on their helmets, and they appear to be in a relaxed pose, with one astronaut raising his arms and the others standing or crouching. The forest is dense with tall trees and dense foliage, creating a serene and somewhat mysterious atmosphere.
Text-to-Image Generation
input text:
"A close-up high-contrast photo of Sydney Opera House sitting next to Eiffel tower, under a blue night sky of roiling energy, exploding yellow stars, and radiating swirls of blue."
log information:
root@ax650 ~/yongqiang/push_hugging_face/Janus-Pro-1B # python3 infer_axmodel_gen.py --tokenizer_dir janus_pro_1b_tokenizer/ --axmodel_path janus_pro_1b_axmodel/
[INFO] Available providers: ['AxEngineExecutionProvider']
Init InferenceSession: 0%| | 0/24 [00:00<?, ?it/s][INFO] Chip type: ChipType.MC50
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Engine version: 2.11.0a
Init InferenceSession: 100%|██████████████████████████████████████████████████████████| 24/24 [00:14<00:00, 1.68it/s]
2025-04-14 15:55:23.408 | INFO | __main__:<module>:269 - model load done!
2025-04-14 15:55:33.104 | DEBUG | __main__:generate:158 - prefill completed!
ImageToken: 18%|████████████ | 104/575 [00:39<02:58, 2.64it/s]ImageToken: 45%|██████████████████████████████▍ | 261/575 [01:39<01:58, 2.65it/s]ImageToken: 73%|████████████████████████████████████████████████▊ | 419/575 [02:39<00:58, 2.66it/s]ImageToken: 100%|███████████████████████████████████████████████████████████████████| 575/575 [03:38<00:00, 2.63it/s]
output image
Base model
deepseek-ai/Janus-Pro-1B