AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents
Abstract
AndroidLab provides a systematic framework for training and evaluating Android agents, supporting both large language models and multimodal models, and improves their task success rates.
Autonomous agents have become increasingly important for interacting with the real world. Android agents, in particular, have been recently a frequently-mentioned interaction method. However, existing studies for training and evaluating Android agents lack systematic research on both open-source and closed-source models. In this work, we propose AndroidLab as a systematic Android agent framework. It includes an operation environment with different modalities, action space, and a reproducible benchmark. It supports both large language models (LLMs) and multimodal models (LMMs) in the same action space. AndroidLab benchmark includes predefined Android virtual devices and 138 tasks across nine apps built on these devices. By using the AndroidLab environment, we develop an Android Instruction dataset and train six open-source LLMs and LMMs, lifting the average success rates from 4.59% to 21.50% for LLMs and from 1.93% to 13.28% for LMMs. AndroidLab is open-sourced and publicly available at https://github.com/THUDM/Android-Lab.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation (2024)
- Lightweight Neural App Control (2024)
- ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents (2024)
- OS-ATLAS: A Foundation Action Model for Generalist GUI Agents (2024)
- MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
My read of this paper:
𝗔𝗻𝗱𝗿𝗼𝗶𝗱𝗟𝗮𝗯: 𝗙𝗶𝗿𝘀𝘁 𝗲𝘃𝗲𝗿 𝘀𝘆𝘀𝘁𝗲𝗺𝗮𝘁𝗶𝗰 𝗯𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸 𝗳𝗼𝗿 𝗔𝗻𝗱𝗿𝗼𝗶𝗱 𝗺𝗼𝗯𝗶𝗹𝗲 𝗮𝗴𝗲𝗻𝘁𝘀 𝘀𝗵𝗼𝘄𝘀 𝘁𝗵𝗮𝘁 𝘀𝗺𝗮𝗹𝗹, 𝗳𝗶𝗻𝗲-𝘁𝘂𝗻𝗲𝗱 𝗼𝗽𝗲𝗻 𝗺𝗼𝗱𝗲𝗹𝘀 𝗰𝗮𝗻 𝗽𝗼𝘄𝗲𝗿 𝗮 𝗝𝗔𝗥𝗩𝗜𝗦 𝘀𝘆𝘀𝘁𝗲𝗺 𝗼𝗻 𝘆𝗼𝘂𝗿 𝘀𝗺𝗮𝗿𝘁𝗽𝗵𝗼𝗻𝗲 📱🔥
A team from Tsinghua University just released AndroidLab, the first systematic framework to evaluate and train Android mobile agents that works with both text-only and multimodal models.
They show that fine-tuning small open-source models can significantly boost performance, matching that of much bigger closed models like GPT-4o.
The team built:
📊 A reproducible benchmark with 138 tasks across 9 apps to evaluate mobile agents systematically
📝📱 A framework supporting both text-only (via XML) and visual (via marked screenshots) interfaces
✅ An instruction dataset of 10.5k operation traces for training mobile agents
Key insights:
- 📈 Fine-tuning improves performance BY A LOT: Open-source model Llama-3.1-8B improves from 2% to 24% success rate after training, nearly reaching GPT-4o performance although it’s much smaller
- ⚙️ Text-only agents match multimodal ones: XML-based agents achieve similar performance to screenshot-based multimodal agents.
Congrats for this great work 🤗
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper