Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Published 5 Dec 2024 in cs.CL | (2412.04454v2)

Abstract: Automating GUI tasks remains challenging due to reliance on textual representations, platform-specific action spaces, and limited reasoning capabilities. We introduce Aguvis, a unified vision-based framework for autonomous GUI agents that directly operates on screen images, standardizes cross-platform interactions and incorporates structured reasoning via inner monologue. To enable this, we construct Aguvis Data Collection, a large-scale dataset with multimodal grounding and reasoning annotations, and develop a two-stage training pipeline that separates GUI grounding from planning and reasoning. Experiments show that Aguvis achieves state-of-the-art performance across offline and real-world online benchmarks, marking the first fully autonomous vision-based GUI agent that operates without closed-source models. We open-source all datasets, models, and training recipes at https://aguvis-project.github.io to advance future research.

Abstract PDF HTML Upgrade to Chat

Summary

The paper presents a unified pure vision framework that automates GUI tasks using image-based observations across websites, desktops, and mobile devices.
The paper introduces a dual-stage training protocol that integrates visual grounding and planning, reducing reliance on verbose text-based representations.
The paper demonstrates superior performance on benchmarks like ScreenSpot and AndroidControl by employing a unified cross-platform action space for efficient reasoning.

Overview of Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

The paper presents Aguvis, a novel framework for building autonomous graphical user interface (GUI) agents, leveraging a unified pure vision approach. Aguvis targets the automation of task execution across digital environments, such as websites, desktops, and mobile devices, using a consistent action space and image-based observations. The framework addresses the limitations of contemporary methods that depend on textual representations, such as generalized models' efficiency and scalability.

Core Competencies and Challenges

The authors outline three critical competencies that must be developed for effective GUI agents: understanding, grounding, and planning. By comprehension of high-resolution and complex human-oriented interfaces, Aguvis enables contextual awareness and reasoning. Grounding maps natural language instructions to GUI observations, while planning synthesizes this information to generate actionable steps towards task completion.

The paper identifies significant challenges in addressing grounding and reasoning for GUI tasks:

Pure Vision Framework Enhancement: Traditional models often use textual representations like HTML or accessibility trees, which can be verbose, environment-specific, and hard to generalize. Aguvis utilizes image-based representations that provide a more uniform and efficient basis for GUI interpretation, aligning more closely with intuitive human cognition and lowering inference latency.
Cross-Platform Action Space Unification: Variabilities in GUI-based interactive environments necessitate a unified action space to enable model generalization. Aguvis pairs vision-based grounding with a "pyautogui" command system, abstracting platform-specific differences into a standardized framework.
Integration of Planning and Grounding: Conventional methods usually rely on closed-source LLMs for reasoning or directly map actions without an explicit reasoning model. Aguvis integrates planning and grounding within its vision-LLM (VLM) pipeline, mitigating reliance on separate reasoning models.

Methodological Innovation

Aguvis proposes an expansive dataset collection strategy, unifying existing GUI interactions and augmenting data through systematic template-based expansions. It operationalizes a two-stage training system: initial GUI grounding through intensive grounding data, followed by planning and reasoning training through multi-step datasets. The inclusion of a dual-stage training protocol allows the model to learn atomic visual grounding tasks before exploring expanded agent trajectory reasoning, thereby enhancing task complexity capabilities.

Notably, results from numerous experiments demonstrate Aguvis' superior performance, outperforming state-of-the-art methods. The model achieved higher accuracy on ScreenSpot, Mind2Web, and AndroidControl, proving its ability to autonomously perform tasks within real-world environments. The research pledges to open-source the dataset, models, and training resources to encourage collaborative advancement within this domain.

Future Implications

In terms of practical applications, the unified framework promises automation advancements across complex digital interfaces. Theoretical developments in AI focusing on task planning, reasoning, and mastering cross-platform competencies are evident from integrating diverse GUI environments into a converging framework.

This paper underscores the potential future direction for autonomous agents in AI, suggesting a shift towards entirely vision-oriented models with more dynamic reasoning and planning abilities independently of traditional LLMs. The functionality and adaptability of such agents could play a crucial role in expanding AI applications and enhancing human-computer interaction metrics.

In conclusion, Aguvis propels a significant step in AI by synergizing vision-LLMs with pragmatic applications, giving an insightful perspective on advancing AI agents' automation capacities in diverse GUI environments.

Markdown