Update README.md
Browse files
README.md
CHANGED
|
@@ -11,4 +11,278 @@ tags:
|
|
| 11 |
base_model:
|
| 12 |
- openbmb/MiniCPM-V-2_6
|
| 13 |
pipeline_tag: image-text-to-text
|
| 14 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
base_model:
|
| 12 |
- openbmb/MiniCPM-V-2_6
|
| 13 |
pipeline_tag: image-text-to-text
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# AgentCPM-GUI
|
| 17 |
+
|
| 18 |
+
[GitHub](https://github.com/OpenBMB/AgentCPM-GUI) | Technical Blog
|
| 19 |
+
|
| 20 |
+
## News
|
| 21 |
+
|
| 22 |
+
* [2025-05-13] 🚀🚀🚀 We have open-sourced **AgentCPM-GUI**, an on-device GUI agent capable of operating Chinese & English apps and equipped with RFT-enhanced reasoning abilities.
|
| 23 |
+
|
| 24 |
+
## Overview
|
| 25 |
+
|
| 26 |
+
**AgentCPM-GUI** is an open-source on-device LLM agent model jointly developed by [THUNLP](https://nlp.csai.tsinghua.edu.cn) and [ModelBest](https://modelbest.cn/en). Built on [MiniCPM-V](https://github.com/OpenBMB/MiniCPM-V) with 8 billion parameters, it accepts smartphone screenshots as input and autonomously executes user-specified tasks.
|
| 27 |
+
|
| 28 |
+
Key features include:
|
| 29 |
+
|
| 30 |
+
- **High-quality GUI grounding** — Pre-training on a large-scale bilingual Android dataset significantly boosts localization and comprehension of common GUI widgets (buttons, input boxes, labels, icons, etc.).
|
| 31 |
+
- **Chinese-app operation** — The first open-source GUI agent finely tuned for Chinese apps, covering 30 + popular titles such as Amap, Dianping, bilibili and Xiaohongshu.
|
| 32 |
+
- **Enhanced planning & reasoning** — Reinforcement fine-tuning (RFT) lets the model “think” before outputting an action, greatly improving success on complex tasks.
|
| 33 |
+
- **Compact action-space design** — An optimized action space and concise JSON format reduce the average action length to 9.7 tokens, boosting on-device inference efficiency.
|
| 34 |
+
|
| 35 |
+
Demo Case (1x speed):
|
| 36 |
+
|
| 37 |
+
https://github.com/user-attachments/assets/5472a659-cd71-4bce-a181-0981129c6a81
|
| 38 |
+
|
| 39 |
+
## Quick Start
|
| 40 |
+
|
| 41 |
+
### Install dependencies
|
| 42 |
+
|
| 43 |
+
```bash
|
| 44 |
+
git clone https://github.com/OpenBMB/AgentCPM-GUI
|
| 45 |
+
cd MiniCPM-Agent
|
| 46 |
+
conda create -n gui_agent python=3.11
|
| 47 |
+
conda activate gui_agent
|
| 48 |
+
pip install -r requirements.txt
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
### Download the model
|
| 52 |
+
|
| 53 |
+
Download [AgentCPM-GUI](https://huggingface.co/openbmb/AgentCPM-GUI) from Hugging Face and place it in `model/AgentCPM-GUI`.
|
| 54 |
+
|
| 55 |
+
#### Huggingface Inference
|
| 56 |
+
|
| 57 |
+
```python
|
| 58 |
+
import torch
|
| 59 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 60 |
+
from PIL import Image
|
| 61 |
+
import json
|
| 62 |
+
|
| 63 |
+
# 1. Load the model and tokenizer
|
| 64 |
+
model_path = "model/AgentCPM-GUI" # model path
|
| 65 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
|
| 66 |
+
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, torch_dtype=torch.bfloat16)
|
| 67 |
+
model = model.to("cuda:0")
|
| 68 |
+
|
| 69 |
+
# 2. Build the input
|
| 70 |
+
instruction = "请点击屏幕上的‘会员’按钮"
|
| 71 |
+
image_path = "assets/test.jpeg"
|
| 72 |
+
image = Image.open(image_path).convert("RGB")
|
| 73 |
+
|
| 74 |
+
# 3. Resize the longer side to 1120 px to save compute & memory
|
| 75 |
+
def __resize__(origin_img):
|
| 76 |
+
resolution = origin_img.size
|
| 77 |
+
w,h = resolution
|
| 78 |
+
max_line_res = 1120
|
| 79 |
+
if max_line_res is not None:
|
| 80 |
+
max_line = max_line_res
|
| 81 |
+
if h > max_line:
|
| 82 |
+
w = int(w * max_line / h)
|
| 83 |
+
h = max_line
|
| 84 |
+
if w > max_line:
|
| 85 |
+
h = int(h * max_line / w)
|
| 86 |
+
w = max_line
|
| 87 |
+
img = origin_img.resize((w,h),resample=Image.Resampling.LANCZOS)
|
| 88 |
+
return img
|
| 89 |
+
image = __resize__(image)
|
| 90 |
+
|
| 91 |
+
# 4. Build the message format
|
| 92 |
+
messages = [{
|
| 93 |
+
"role": "user",
|
| 94 |
+
"content": [
|
| 95 |
+
f"<Question>{instruction}</Question>\n当前屏幕截图:",
|
| 96 |
+
image
|
| 97 |
+
]
|
| 98 |
+
}]
|
| 99 |
+
|
| 100 |
+
# 5. Inference
|
| 101 |
+
ACTION_SCHEMA = json.load(open('eval/utils/schema/schema.json', encoding="utf-8"))
|
| 102 |
+
items = list(ACTION_SCHEMA.items())
|
| 103 |
+
insert_index = 3
|
| 104 |
+
items.insert(insert_index, ("required", ["thought"])) # enable/disable thought by setting it to "required"/"optional"
|
| 105 |
+
ACTION_SCHEMA = dict(items)
|
| 106 |
+
SYSTEM_PROMPT = f'''# Role
|
| 107 |
+
你是一名熟悉安卓系统触屏GUI操作的智能体,将根据用户的问题,分析当前界面的GUI元素和布局,生成相应的操作。
|
| 108 |
+
|
| 109 |
+
# Task
|
| 110 |
+
针对用户问题,根据输入的当前屏幕截图,输出下一步的操作。
|
| 111 |
+
|
| 112 |
+
# Rule
|
| 113 |
+
- 以紧凑JSON格式输出
|
| 114 |
+
- 输出操作必须遵循Schema约束
|
| 115 |
+
|
| 116 |
+
# Schema
|
| 117 |
+
{json.dumps(ACTION_SCHEMA, indent=None, ensure_ascii=False, separators=(',', ':'))}'''
|
| 118 |
+
|
| 119 |
+
outputs = model.chat(
|
| 120 |
+
image=None,
|
| 121 |
+
msgs=messages,
|
| 122 |
+
system_prompt=SYSTEM_PROMPT,
|
| 123 |
+
tokenizer=tokenizer,
|
| 124 |
+
temperature=0.1,
|
| 125 |
+
top_p=0.3,
|
| 126 |
+
n=1,
|
| 127 |
+
)
|
| 128 |
+
|
| 129 |
+
# 6. Output
|
| 130 |
+
print(outputs)
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
+
Expected output:
|
| 134 |
+
|
| 135 |
+
```JSON
|
| 136 |
+
{"thought":"任务目标是点击屏幕上的‘会员’按钮。当前界面显示了应用的推荐页面,顶部有一个导航栏。点击‘会员’按钮可以访问应用的会员相关内容。","POINT":[729,69]}
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
#### vLLM Inference
|
| 140 |
+
|
| 141 |
+
```bash
|
| 142 |
+
# Launch the vLLM server
|
| 143 |
+
vllm serve model/AgentCPM-GUI --served-model-name AgentCPM-GUI --tensor_parallel_size 1 --trust-remote-code
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
```python
|
| 147 |
+
import base64
|
| 148 |
+
import io
|
| 149 |
+
import json
|
| 150 |
+
import requests
|
| 151 |
+
from PIL import Image
|
| 152 |
+
|
| 153 |
+
END_POINT = "http://localhost:8000/v1/chat/completions" # Replace with actual endpoint
|
| 154 |
+
|
| 155 |
+
# system prompt
|
| 156 |
+
ACTION_SCHEMA = json.load(open('eval/utils/schema/schema.json', encoding="utf-8"))
|
| 157 |
+
items = list(ACTION_SCHEMA.items())
|
| 158 |
+
insert_index = 3
|
| 159 |
+
items.insert(insert_index, ("required", ["thought"])) # enable/disable thought by setting it to "required"/"optional"
|
| 160 |
+
ACTION_SCHEMA = dict(items)
|
| 161 |
+
SYSTEM_PROMPT = f'''# Role
|
| 162 |
+
你是一名熟悉安卓系统触屏GUI操作的智能体,将根据用户的问题,分析当前界面的GUI元素和布局,生成相应的操作。
|
| 163 |
+
|
| 164 |
+
# Task
|
| 165 |
+
针对用户问题,根据输入的当前屏幕截图,输出下一步的操作。
|
| 166 |
+
|
| 167 |
+
# Rule
|
| 168 |
+
- 以紧凑JSON格式输出
|
| 169 |
+
- 输出操作必须遵循Schema约束
|
| 170 |
+
|
| 171 |
+
# Schema
|
| 172 |
+
{json.dumps(ACTION_SCHEMA, indent=None, ensure_ascii=False, separators=(',', ':'))}'''
|
| 173 |
+
|
| 174 |
+
def encode_image(image: Image.Image) -> str:
|
| 175 |
+
"""Convert PIL Image to base64-encoded string."""
|
| 176 |
+
with io.BytesIO() as in_mem_file:
|
| 177 |
+
image.save(in_mem_file, format="JPEG")
|
| 178 |
+
in_mem_file.seek(0)
|
| 179 |
+
return base64.b64encode(in_mem_file.read()).decode("utf-8")
|
| 180 |
+
|
| 181 |
+
def __resize__(origin_img):
|
| 182 |
+
resolution = origin_img.size
|
| 183 |
+
w,h = resolution
|
| 184 |
+
max_line_res = 1120
|
| 185 |
+
if max_line_res is not None:
|
| 186 |
+
max_line = max_line_res
|
| 187 |
+
if h > max_line:
|
| 188 |
+
w = int(w * max_line / h)
|
| 189 |
+
h = max_line
|
| 190 |
+
if w > max_line:
|
| 191 |
+
h = int(h * max_line / w)
|
| 192 |
+
w = max_line
|
| 193 |
+
img = origin_img.resize((w,h),resample=Image.Resampling.LANCZOS)
|
| 194 |
+
return img
|
| 195 |
+
|
| 196 |
+
def predict(text_prompt: str, image: Image.Image):
|
| 197 |
+
messages = [
|
| 198 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 199 |
+
{"role": "user", "content": [
|
| 200 |
+
{"type": "text", "text": f"<Question>{text_prompt}</Question>\n当前屏幕截图:"},
|
| 201 |
+
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encode_image(image)}"}}
|
| 202 |
+
]}
|
| 203 |
+
]
|
| 204 |
+
|
| 205 |
+
payload = {
|
| 206 |
+
"model": "AgentCPM-GUI", # Your model name
|
| 207 |
+
"temperature": 0.1,
|
| 208 |
+
"messages": messages,
|
| 209 |
+
"max_tokens": 2048,
|
| 210 |
+
}
|
| 211 |
+
|
| 212 |
+
headers = {
|
| 213 |
+
"Content-Type": "application/json",
|
| 214 |
+
}
|
| 215 |
+
|
| 216 |
+
response = requests.post(END_POINT, headers=headers, json=payload)
|
| 217 |
+
assistant_msg = response.json()["choices"][0]["message"]["content"]
|
| 218 |
+
return assistant_msg
|
| 219 |
+
|
| 220 |
+
image = __resize__(Image.open("assets/test.jpeg"))
|
| 221 |
+
instruction = "请点击屏幕上的‘会员’按钮"
|
| 222 |
+
response = predict(instruction, image)
|
| 223 |
+
print(response)
|
| 224 |
+
```
|
| 225 |
+
|
| 226 |
+
## Fine-tuning
|
| 227 |
+
|
| 228 |
+
Source code for SFT and RFT training is provided — see [SFT](sft/readme.md) and [RFT](rft/readme.md).
|
| 229 |
+
|
| 230 |
+
## Performance Evaluation
|
| 231 |
+
|
| 232 |
+
### Grounding Benchmark
|
| 233 |
+
|
| 234 |
+
| Model | fun2point | text2point | bbox2text | average |
|
| 235 |
+
| ------------------------- | -------------- | -------------- | -------------- | -------------- |
|
| 236 |
+
| **AgentCPM-GUI-8B** | **79.1** | **76.5** | **58.2** | **71.3** |
|
| 237 |
+
| Qwen2.5-VL-7B | 36.8 | 52.0 | 44.1 | 44.3 |
|
| 238 |
+
| Intern2.5-VL-8B | 17.2 | 24.2 | 45.9 | 29.1 |
|
| 239 |
+
| Intern2.5-VL-26B | 14.8 | 16.6 | 36.3 | 22.6 |
|
| 240 |
+
| OS-Genesis-7B | 8.3 | 5.8 | 4.0 | 6.0 |
|
| 241 |
+
| UI-TARS-7B | 56.8 | 66.7 | 1.4 | 41.6 |
|
| 242 |
+
| OS-Altas-7B | 53.6 | 60.7 | 0.4 | 38.2 |
|
| 243 |
+
| Aguvis-7B | 60.8 | **76.5** | 0.2 | 45.8 |
|
| 244 |
+
| GPT-4o | 22.1 | 19.9 | 14.3 | 18.8 |
|
| 245 |
+
| GPT-4o with Grounding | 44.3 | 44.0 | 14.3 | 44.2 |
|
| 246 |
+
|
| 247 |
+
### Agent Benchmark
|
| 248 |
+
|
| 249 |
+
| Dataset | Android Control-Low TM | Android Control-Low EM | Android Control-High TM | Android Control-High EM | GUI-Odyssey TM | GUI-Odyssey EM | AITZ TM | AITZ EM | Chinese APP TM | Chinese APP EM |
|
| 250 |
+
| ------------------------- | ---------------------- | ---------------------- | ----------------------- | ----------------------- | --------------- | --------------- | --------------- | --------------- | --------------- | --------------- |
|
| 251 |
+
| **AgentCPM-GUI-8B** | **94.39** | **90.20** | **77.70** | **69.17** | **90.85** | **74.96** | **85.71** | **76.38** | **96.86** | **91.28** |
|
| 252 |
+
| Qwen2.5-VL-7B | 92.11 | 82.12 | 69.65 | 57.36 | 55.33 | 40.90 | 73.16 | 57.58 | 68.53 | 48.80 |
|
| 253 |
+
| UI-TARS-7B | 93.52 | 88.89 | 68.53 | 60.81 | 78.79 | 57.33 | 71.74 | 55.31 | 71.01 | 53.92 |
|
| 254 |
+
| OS-Genesis-7B | 90.74 | 74.22 | 65.92 | 44.43 | 11.67 | 3.63 | 19.98 | 8.45 | 38.10 | 14.50 |
|
| 255 |
+
| OS-Atlas-7B | 73.03 | 67.25 | 70.36 | 56.53 | 91.83* | 76.76* | 74.13 | 58.45 | 81.53 | 55.89 |
|
| 256 |
+
| Aguvis-7B | 93.85 | 89.40 | 65.56 | 54.18 | 26.71 | 13.54 | 35.71 | 18.99 | 67.43 | 38.20 |
|
| 257 |
+
| OdysseyAgent-7B | 65.10 | 39.16 | 58.80 | 32.74 | 90.83 | 73.67 | 59.17 | 31.60 | 67.56 | 25.44 |
|
| 258 |
+
| GPT-4o | - | 19.49 | - | 20.80 | - | 20.39 | 70.00 | 35.30 | 3.67 | 3.67 |
|
| 259 |
+
| Gemini 2.0 | - | 28.50 | - | 60.20 | - | 3.27 | - | - | - | - |
|
| 260 |
+
| Claude | - | 19.40 | - | 12.50 | 60.90 | - | - | - | - | - |
|
| 261 |
+
|
| 262 |
+
> \*Different train/test splits
|
| 263 |
+
|
| 264 |
+
All evaluation data and code are open-sourced — see [here](eval) for details.
|
| 265 |
+
|
| 266 |
+
## Evaluation Data
|
| 267 |
+
|
| 268 |
+
We provide **CAGUI**, an evaluation benchmark for Chinese apps covering **grounding** and **agent** tasks.
|
| 269 |
+
See the dataset on [Hugging Face](https://huggingface.co/datasets/openbmb/CAGUI).
|
| 270 |
+
|
| 271 |
+
## License
|
| 272 |
+
|
| 273 |
+
* Code in this repository is released under the [Apache-2.0](./LICENSE) license.
|
| 274 |
+
|
| 275 |
+
## Citation
|
| 276 |
+
|
| 277 |
+
If **AgentCPM-GUI** is useful for your research, please cite:
|
| 278 |
+
|
| 279 |
+
```bibtex
|
| 280 |
+
@misc{2025,
|
| 281 |
+
author = {THUNLP},
|
| 282 |
+
title = {AgentCPM-GUI},
|
| 283 |
+
year = {2025},
|
| 284 |
+
publisher = {GitHub},
|
| 285 |
+
journal = {GitHub repository},
|
| 286 |
+
howpublished = {\url{https://github.com/OpenBMB/AgentCPM-GUI}}
|
| 287 |
+
}
|
| 288 |
+
```
|