qianhuiwu's picture
Update README.md
496c8ef verified
|
raw
history blame
6.7 kB
metadata
base_model:
  - Qwen/Qwen2-VL-7B-Instruct
license: mit
library_name: transformers
pipeline_tag: image-text-to-text

GUI-Actor-7B with Qwen2-VL-7B as backbone VLM

This model was introduced in the paper GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents. It is developed based on Qwen2-VL-7B-Instruct , augmented by an attention-based action head and finetuned to perform GUI grounding using the dataset here (coming soon).

For more details on model design and evaluation, please check: 🏠 Project Page | 💻 Github Repo | 📑 Paper.

Model Name Hugging Face Link
GUI-Actor-7B-Qwen2-VL 🤗 Hugging Face
GUI-Actor-2B-Qwen2-VL 🤗 Hugging Face
GUI-Actor-7B-Qwen2.5-VL 🤗 Hugging Face
GUI-Actor-3B-Qwen2.5-VL 🤗 Hugging Face
GUI-Actor-Verifier-2B 🤗 Hugging Face

📊 Performance Comparison on GUI Grounding Benchmarks

Table 1. Main results on ScreenSpot-Pro, ScreenSpot, and ScreenSpot-v2 with Qwen2-VL as the backbone. † indicates scores obtained from our own evaluation of the official models on Huggingface.

Method Backbone VLM ScreenSpot-Pro ScreenSpot ScreenSpot-v2
72B models:
AGUVIS-72B Qwen2-VL - 89.2 -
UGround-V1-72B Qwen2-VL 34.5 89.4 -
UI-TARS-72B Qwen2-VL 38.1 88.4 90.3
7B models:
OS-Atlas-7B Qwen2-VL 18.9 82.5 84.1
AGUVIS-7B Qwen2-VL 22.9 84.4 86.0†
UGround-V1-7B Qwen2-VL 31.1 86.3 87.6†
UI-TARS-7B Qwen2-VL 35.7 89.5 91.6
GUI-Actor-7B Qwen2-VL 40.7 88.3 89.5
GUI-Actor-7B + Verifier Qwen2-VL 44.2 89.7 90.9
2B models:
UGround-V1-2B Qwen2-VL 26.6 77.1 -
UI-TARS-2B Qwen2-VL 27.7 82.3 84.7
GUI-Actor-2B Qwen2-VL 36.7 86.5 88.6
GUI-Actor-2B + Verifier Qwen2-VL 41.8 86.9 89.3

Table 2. Main results on the ScreenSpot-Pro and ScreenSpot-v2 with Qwen2.5-VL as the backbone.

Method Backbone VLM ScreenSpot-Pro ScreenSpot-v2
7B models:
Qwen2.5-VL-7B Qwen2.5-VL 27.6 88.8
Jedi-7B Qwen2.5-VL 39.5 91.7
GUI-Actor-7B Qwen2.5-VL 44.6 92.1
GUI-Actor-7B + Verifier Qwen2.5-VL 47.7 92.5
3B models:
Qwen2.5-VL-3B Qwen2.5-VL 25.9 80.9
Jedi-3B Qwen2.5-VL 36.1 88.6
GUI-Actor-3B Qwen2.5-VL 42.2 91.0
GUI-Actor-3B + Verifier Qwen2.5-VL 45.9 92.4

🚀 Usage

import torch

from qwen_vl_utils import process_vision_info
from datasets import load_dataset
from transformers import Qwen2VLProcessor
from gui_actor.constants import chat_template
from gui_actor.modeling import Qwen2VLForConditionalGenerationWithPointer
from gui_actor.inference import inference


# load model
model_name_or_path = "microsoft/GUI-Actor-7B-Qwen2-VL"
data_processor = Qwen2VLProcessor.from_pretrained(model_name_or_path)
tokenizer = data_processor.tokenizer
model = Qwen2VLForConditionalGenerationWithPointer.from_pretrained(
    model_name_or_path,
    torch_dtype=torch.bfloat16,
    device_map="cuda:0",
    attn_implementation="flash_attention_2"
).eval()

# prepare example
dataset = load_dataset("rootsautomation/ScreenSpot")["test"]
example = dataset[0]
print(f"Intruction: {example['instruction']}")
print(f"ground-truth action region (x1, y1, x2, y2): {[round(i, 2) for i in example['bbox']]}")

conversation = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": "You are a GUI agent. You are given a task and a screenshot of the screen. You need to perform a series of pyautogui actions to complete the task.",
            }
        ]
    },
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": example["image"], # PIL.Image.Image or str to path
                # "image_url": "https://xxxxx.png" or "https://xxxxx.jpg" or "file://xxxxx.png" or "data:image/png;base64,xxxxxxxx", will be split by "base64,"
            },
            {
                "type": "text",
                "text": example["instruction"]
            },
        ],
    },
]

# inference
pred = inference(conversation, model, tokenizer, data_processor, use_placeholder=True, topk=3)
px, py = pred["topk_points"][0]
print(f"Predicted click point: [{round(px, 4)}, {round(py, 4)}]")

# >> Model Response
# Intruction: close this window
# ground-truth action region (x1, y1, x2, y2): [0.9479, 0.1444, 0.9938, 0.2074]
# Predicted click point: [0.9709, 0.1548]

📝 Citation

@article{wu2025guiactor,
    title={GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents}, 
    author={Qianhui Wu and Kanzhi Cheng and Rui Yang and Chaoyun Zhang and Jianwei Yang and Huiqiang Jiang and Jian Mu and Baolin Peng and Bo Qiao and Reuben Tan and Si Qin and Lars Liden and Qingwei Lin and Huan Zhang and Tong Zhang and Jianbing Zhang and Dongmei Zhang and Jianfeng Gao},
    year={2025},
    eprint={2506.03143},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://www.arxiv.org/pdf/2506.03143},
}