AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents
Abstract: Autonomous agents have become increasingly important for interacting with the real world. Android agents, in particular, have been recently a frequently-mentioned interaction method. However, existing studies for training and evaluating Android agents lack systematic research on both open-source and closed-source models. In this work, we propose AndroidLab as a systematic Android agent framework. It includes an operation environment with different modalities, action space, and a reproducible benchmark. It supports both LLMs and multimodal models (LMMs) in the same action space. AndroidLab benchmark includes predefined Android virtual devices and 138 tasks across nine apps built on these devices. By using the AndroidLab environment, we develop an Android Instruction dataset and train six open-source LLMs and LMMs, lifting the average success rates from 4.59% to 21.50% for LLMs and from 1.93% to 13.28% for LMMs. AndroidLab is open-sourced and publicly available at https://github.com/THUDM/Android-Lab.
- Anthropic. 2023. Introducing claude.
- Program synthesis with large language models.
- Qwen technical report.
- Mobile app tasks with iterative feedback (motif): Addressing task feasibility in interactive visual environments. arXiv preprint arXiv:2104.08560.
- Evaluating large language models trained on code.
- Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces.
- Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070.
- Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793.
- Stabletoolbench: Towards stable large-scale benchmarking on tool learning of large language models.
- A real-world webagent with planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856.
- Cogagent: A visual language model for gui agents.
- Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. arXiv preprint arXiv:2402.17553.
- Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. arXiv preprint arXiv:2401.13649.
- Autowebglm: Bootstrap and reinforce a large language model-based web navigating agent. arXiv preprint arXiv:2404.03648.
- Benchmarking mobile device control agents across diverse configurations.
- Api-bank: A benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244.
- Mapping natural language instructions to mobile UI action sequences. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8198–8210, Online. Association for Computational Linguistics.
- Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations (ICLR).
- Webglm: Towards an efficient web-enhanced question answering system with human preferences. arXiv preprint arXiv:2306.07906.
- Gaia: a benchmark for general ai assistants.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
- OpenAI. 2023. Gpt-4 technical report.
- Revisiting, benchmarking and exploring api recommendation: How far are we?
- Robotic process automation: the virtual workforce. International Journal on Future Revolution in Computer Science & Communication Engineering, 5(2):28–32.
- Androidworld: A dynamic benchmarking environment for autonomous agents.
- Android in the wild: A large-scale dataset for android device control. arXiv preprint arXiv:2307.10088.
- Robotic process automation: A case study in the banking industry. In 2019 14th Iberian Conference on information systems and technologies (CISTI), pages 1–6. IEEE.
- Meta-gui: Towards multi-modal conversational agents on mobile gui.
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Androidenv: A reinforcement learning platform for android. arXiv preprint arXiv:2105.13231.
- Ugif: Ui grounded instruction following.
- Enabling conversational interaction with mobile ui using large language models.
- Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191.
- Cogvlm: Visual expert for pretrained language models.
- Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.
- Understanding the weakness of large language model agents within a complex android environment. arXiv preprint arXiv:2402.06596.
- Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation.
- Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.
- Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771.
- Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757.
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
- Zhuosheng Zhan and Aston Zhang. 2023. You only look at screens: Multimodal chain-of-action agents. arXiv preprint arXiv:2309.11436.
- Naturalcodebench: Examining coding performance mismatch on humaneval and natural user prompts. arXiv preprint arXiv:2405.04520.
- Gpt-4v (ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614.
- Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. arXiv preprint arXiv:2303.17568.
- Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.