File size: 3,479 Bytes
6d89162 4ee0ad8 6d89162 683b60c 6d89162 bca2104 6d89162 bca2104 6d89162 bca2104 6d89162 7c1644c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
---
license: mit
language:
- en
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
tags:
- arxiv:2508.15144
---
# GUI-Owl
<div align="center">
<img src=https://youke1.picui.cn/s1/2025/08/18/68a2f82fef3d4.png width="40%"/>
</div>
GUI-Owl is a model series developed as part of the Mobile-Agent-V3 project. It achieves state-of-the-art performance across a range of GUI automation benchmarks, including ScreenSpot-V2, ScreenSpot-Pro, OSWorld-G, MMBench-GUI, Android Control, Android World, and OSWorld. Furthermore, it can be instantiated as various specialized agents within the Mobile-Agent-V3 multi-agent framework to accomplish more complex tasks.
* **Paper**: [Paper Link](https://github.com/X-PLUG/MobileAgent/blob/main/Mobile-Agent-v3/assets/MobileAgentV3_Tech.pdf)
* **GitHub Repository**: https://github.com/X-PLUG/MobileAgent
* **Online Demo**: Comming soon
## Performance
### ScreenSpot-V2, ScreenSpot-Pro and OSWorld-G
<img src="https://github.com/X-PLUG/MobileAgent/blob/main/Mobile-Agent-v3/assets/screenspot_v2.jpg?raw=true" width="80%"/>
<img src="https://github.com/X-PLUG/MobileAgent/blob/main/Mobile-Agent-v3/assets/screenspot_pro.jpg?raw=true" width="80%"/>
<img src="https://github.com/X-PLUG/MobileAgent/blob/main/Mobile-Agent-v3/assets/osworld_g.jpg?raw=true" width="80%"/>
### MMBench-GUI L1, L2 and Android Control
<img src="https://github.com/X-PLUG/MobileAgent/blob/main/Mobile-Agent-v3/assets/mmbench_gui_l1.jpg?raw=true" width="80%"/>
<img src="https://github.com/X-PLUG/MobileAgent/blob/main/Mobile-Agent-v3/assets/mmbench_gui_l2.jpg?raw=true" width="80%"/>
<img src="https://github.com/X-PLUG/MobileAgent/blob/main/Mobile-Agent-v3/assets/android_control.jpg?raw=true" width="60%"/>
### Android World and OSWorld-Verified
<img src="https://github.com/X-PLUG/MobileAgent/blob/main/Mobile-Agent-v3/assets/online.jpg?raw=true" width="60%"/>
## Usage
Please refer to our cookbook.
## Deploy
We recommand deploy GUI-Owl-7B through vllm
This script has been validated on an A100 with 96 GB of VRAM.
```bash
PIXEL_ARGS='{"min_pixels":3136,"max_pixels":10035200}'
IMAGE_LIMIT_ARGS='image=2'
MP_SIZE=1
MM_KWARGS=(
--mm-processor-kwargs $PIXEL_ARGS
--limit-mm-per-prompt $IMAGE_LIMIT_ARGS
)
vllm serve $CKPT \
--max-model-len 32768 ${MM_KWARGS[@]} \
--tensor-parallel-size $MP_SIZE \
--allowed-local-media-path '/' \
--port 4243
```
If you want GUI-Owl to recieve more than two images, you could increase `IMAGE_LIMIT_ARGS` and reduce `max_pixels`.
For example:
```bash
PIXEL_ARGS='{"min_pixels":3136,"max_pixels":3211264}'
IMAGE_LIMIT_ARGS='image=5'
MP_SIZE=1
MM_KWARGS=(
--mm-processor-kwargs $PIXEL_ARGS
--limit-mm-per-prompt $IMAGE_LIMIT_ARGS
)
vllm serve $CKPT \
--max-model-len 32768 ${MM_KWARGS[@]} \
--tensor-parallel-size $MP_SIZE \
--allowed-local-media-path '/' \
--port 4243
```
## Citation
If you find our paper and model useful in your research, feel free to give us a cite.
```
@misc{ye2025mobileagentv3foundamentalagentsgui,
title={Mobile-Agent-v3: Foundamental Agents for GUI Automation},
author={Jiabo Ye and Xi Zhang and Haiyang Xu and Haowei Liu and Junyang Wang and Zhaoqing Zhu and Ziwei Zheng and Feiyu Gao and Junjie Cao and Zhengxi Lu and Jitong Liao and Qi Zheng and Fei Huang and Jingren Zhou and Ming Yan},
year={2025},
eprint={2508.15144},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2508.15144},
}
```
|