AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents

Published 31 Oct 2024 in cs.AI | (2410.24024v2)

Abstract: Autonomous agents have become increasingly important for interacting with the real world. Android agents, in particular, have been recently a frequently-mentioned interaction method. However, existing studies for training and evaluating Android agents lack systematic research on both open-source and closed-source models. In this work, we propose AndroidLab as a systematic Android agent framework. It includes an operation environment with different modalities, action space, and a reproducible benchmark. It supports both LLMs and multimodal models (LMMs) in the same action space. AndroidLab benchmark includes predefined Android virtual devices and 138 tasks across nine apps built on these devices. By using the AndroidLab environment, we develop an Android Instruction dataset and train six open-source LLMs and LMMs, lifting the average success rates from 4.59% to 21.50% for LLMs and from 1.93% to 13.28% for LMMs. AndroidLab is open-sourced and publicly available at https://github.com/THUDM/Android-Lab.

Abstract PDF HTML Upgrade to Chat

References (48)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a unified framework for developing and benchmarking autonomous agents on Android, offering a standardized environment and action space.
It employs multimodal observation from screen pixels, view hierarchy, and natural language instructions to enable precise agent interactions.
Post-training results reveal marked improvements, with fine-tuned LLMs and LMMs increasing success rates to 21.50% and 13.28% respectively.

AndroidLab provides a systematic framework for the development, training, and evaluation of autonomous agents designed to interact with Android environments (2410.24024). It addresses the lack of standardized tooling and benchmarks for comparing both open-source and closed-source models, particularly LLMs and Large Multimodal Models (LMMs), operating within the Android ecosystem. The framework comprises an operational environment, a defined action space, a reproducible benchmark suite, and a specialized dataset for instruction tuning.

AndroidLab Environment

The core of AndroidLab is its operational environment, designed to facilitate agent interaction with Android Virtual Devices (AVDs). This environment provides agents with multimodal observational data and accepts actions within a standardized space.

Modalities and Observation Space

The environment exposes multiple modalities to the agent to perceive the device state comprehensively. This includes:

Screen Pixels: Raw visual information from the device screen.
View Hierarchy: Structural information of the UI elements currently displayed, typically represented as an XML-like structure. This provides crucial context about interactable elements, their properties (e.g., resource ID, text content, bounding box), and their relationships.
Task Description: Natural language instructions specifying the goal the agent needs to achieve.

This multimodal input stream allows agents, particularly LMMs, to leverage both visual and structural information for decision-making.

Action Space

A key contribution is the unified action space designed to be compatible with both LLMs and LMMs. This space abstracts common user interactions on Android devices. The specific actions supported include operations like tapping on UI elements identified by their properties (e.g., text content, resource ID, or coordinates derived from the view hierarchy or visual grounding), inputting text into fields, swiping, and system-level actions (e.g., pressing the back or home button). Representing actions in a structured format (e.g., JSON or function calls) allows different model architectures to generate executable commands within the same framework. The design aims for reproducibility and simplifies the process of adapting diverse models to the Android interaction task.

Android Virtual Devices (AVDs)

AndroidLab utilizes pre-configured AVDs as the execution backend. This ensures a controlled and reproducible environment. The framework includes setup scripts and configurations for these AVDs, which host the applications used in the benchmark tasks. Using AVDs allows for parallel execution and isolation, facilitating large-scale experimentation and benchmarking.

AndroidLab Benchmark

The framework includes a reproducible benchmark consisting of 138 tasks distributed across nine common Android applications. These applications are pre-installed on the provided AVD configurations.

Task Design

The tasks are designed to cover a range of typical user interactions and complexities, from simple navigation and information retrieval to more complex multi-step operations involving data entry and manipulation. Examples include tasks like "Send an email with subject X and body Y," "Set a reminder for time T," or "Find directions from location A to location B." Each task is defined by a natural language instruction.

Evaluation Metrics

The primary evaluation metric is the Success Rate (SR), which measures the percentage of tasks completed successfully by the agent. Task completion is typically determined by checking if the final state of the application or device matches the state expected upon successful execution of the given instruction. The benchmark infrastructure provides mechanisms for automated evaluation based on predefined success criteria for each task.

Agent Training and Dataset

Recognizing the performance gap of existing models on Android interaction tasks, AndroidLab facilitates agent training through a custom dataset.

Android Instruction Dataset

An "Android Instruction" dataset was curated using the AndroidLab environment. This dataset comprises trajectories of interactions, where each step includes the multimodal observation (screen, view hierarchy), the natural language instruction, and the corresponding ground-truth action taken to progress towards the task goal. This dataset is specifically designed for instruction-tuning LLMs and LMMs to improve their ability to map instructions and device states to appropriate actions within the Android environment.

Training Methodology

The paper demonstrates the effectiveness of this dataset by fine-tuning several open-source models. Six models (both LLMs and LMMs) were trained using the Android Instruction dataset. The training objective is typically to maximize the likelihood of predicting the correct action given the instruction and the current state observation. Standard supervised fine-tuning techniques are employed.

Experimental Results and Findings

Systematic benchmarking was performed using the AndroidLab framework, evaluating both pre-trained and fine-tuned models.

Baseline Performance

Initial evaluations revealed low success rates for pre-trained, off-the-shelf models on the benchmark tasks. Average success rates were reported as 4.59% for the evaluated LLMs and 1.93% for the LMMs. This highlights the challenge of applying general-purpose models directly to complex, goal-oriented interaction tasks within the Android GUI paradigm without specific adaptation.

Post-Training Performance Improvement

Significant performance improvements were observed after fine-tuning the open-source models on the Android Instruction dataset.

The average success rate for the fine-tuned LLMs increased from 4.59% to 21.50%.
The average success rate for the fine-tuned LMMs increased from 1.93% to 13.28%.

These results demonstrate the efficacy of the curated dataset and the fine-tuning process within the AndroidLab framework for enhancing agent capabilities in Android interaction. Although LMMs started from a lower baseline, the relative improvement suggests their potential, possibly requiring further optimization or architectural adaptation for better utilization of visual cues in conjunction with structural information. The framework also supports evaluating closed-source models (e.g., via APIs), allowing for broader comparisons, although the paper primarily focused on improvements achieved through fine-tuning open-source models.

Conclusion

AndroidLab offers a valuable contribution by providing an open-source, systematic framework and benchmark for Android autonomous agents (2410.24024). Its standardized environment, unified action space, diverse task suite, and curated training dataset facilitate reproducible research and development in this area. The demonstrated significant improvement in model performance after fine-tuning underscores the importance of specialized training data and methodologies for enabling effective agent interaction with mobile GUIs. The framework serves as a robust platform for future research in Android agents, supporting the evaluation and comparison of a wide range of AI models.