microsoft
/

GUI-Actor-2B-Qwen2-VL

Image-Text-to-Text

text-generation-inference

Model card Files Files and versions

qianhuiwu commited on Jun 3

Commit

9979c7f

·

verified ·

1 Parent(s): d6f77cc

update model card.

Files changed (1) hide show

README.md +5 -5

README.md CHANGED Viewed

@@ -6,6 +6,11 @@ base_model:
 # GUI-Actor-2B with Qwen2-VL-2B as backbone VLM
 | Model Name                                  | Hugging Face Link                         |
 |--------------------------------------------|--------------------------------------------|
 | **GUI-Actor-7B-Qwen2-VL**                   | [🤗 Hugging Face](https://huggingface.co/microsoft/GUI-Actor-7B-Qwen2-VL)         |
@@ -14,11 +19,6 @@ base_model:
 | **GUI-Actor-3B-Qwen2.5-VL (coming soon)**   | [🤗 Hugging Face](https://huggingface.co/microsoft/GUI-Actor-3B-Qwen2.5-VL)       |
 | **GUI-Actor-Verifier-2B**                   | [🤗 Hugging Face](https://huggingface.co/microsoft/GUI-Actor-Verifier-2B)        |
-This model was introduced in the paper [**GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents**](https://aka.ms/GUI-Actor).
-It is developed based on [Qwen2-VL-2B-Instruct ](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct), augmented by an attention-based action head and finetuned to perform GUI grounding using the dataset [here (coming soon)]().
-For more details on model design and evaluation, please check: [🏠 Project Page](https://aka.ms/GUI-Actor) | [💻 Github Repo](https://github.com/microsoft/GUI-Actor) | [📑 Paper]().
 ## 📊 Performance Comparison on GUI Grounding Benchmarks
 Table 1. Main results on ScreenSpot-Pro, ScreenSpot, and ScreenSpot-v2 with **Qwen2-VL** as the backbone. † indicates scores obtained from our own evaluation of the official models on Huggingface.
 | Method           | Backbone VLM | ScreenSpot-Pro | ScreenSpot | ScreenSpot-v2 |

 # GUI-Actor-2B with Qwen2-VL-2B as backbone VLM
+This model was introduced in the paper [**GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents**](https://aka.ms/GUI-Actor).
+It is developed based on [Qwen2-VL-2B-Instruct ](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct), augmented by an attention-based action head and finetuned to perform GUI grounding using the dataset [here (coming soon)]().
+For more details on model design and evaluation, please check: [🏠 Project Page](https://aka.ms/GUI-Actor) | [💻 Github Repo](https://github.com/microsoft/GUI-Actor) | [📑 Paper]().
 | Model Name                                  | Hugging Face Link                         |
 |--------------------------------------------|--------------------------------------------|
 | **GUI-Actor-7B-Qwen2-VL**                   | [🤗 Hugging Face](https://huggingface.co/microsoft/GUI-Actor-7B-Qwen2-VL)         |
 | **GUI-Actor-3B-Qwen2.5-VL (coming soon)**   | [🤗 Hugging Face](https://huggingface.co/microsoft/GUI-Actor-3B-Qwen2.5-VL)       |
 | **GUI-Actor-Verifier-2B**                   | [🤗 Hugging Face](https://huggingface.co/microsoft/GUI-Actor-Verifier-2B)        |
 ## 📊 Performance Comparison on GUI Grounding Benchmarks
 Table 1. Main results on ScreenSpot-Pro, ScreenSpot, and ScreenSpot-v2 with **Qwen2-VL** as the backbone. † indicates scores obtained from our own evaluation of the official models on Huggingface.
 | Method           | Backbone VLM | ScreenSpot-Pro | ScreenSpot | ScreenSpot-v2 |