Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

Published 24 Oct 2024 in cs.CV, cs.CL, and cs.LG | (2410.18967v2)

Abstract: Building a generalist model for user interface (UI) understanding is challenging due to various foundational issues, such as platform diversity, resolution variation, and data limitation. In this paper, we introduce Ferret-UI 2, a multimodal LLM (MLLM) designed for universal UI understanding across a wide range of platforms, including iPhone, Android, iPad, Webpage, and AppleTV. Building on the foundation of Ferret-UI, Ferret-UI 2 introduces three key innovations: support for multiple platform types, high-resolution perception through adaptive scaling, and advanced task training data generation powered by GPT-4o with set-of-mark visual prompting. These advancements enable Ferret-UI 2 to perform complex, user-centered interactions, making it highly versatile and adaptable for the expanding diversity of platform ecosystems. Extensive empirical experiments on referring, grounding, user-centric advanced tasks (comprising 9 subtasks $\times$ 5 platforms), GUIDE next-action prediction dataset, and GUI-World multi-platform benchmark demonstrate that Ferret-UI 2 significantly outperforms Ferret-UI, and also shows strong cross-platform transfer capabilities.

Abstract PDF HTML Upgrade to Chat

References (59)

Citations (3)

View on Semantic Scholar

Summary

The paper demonstrates a novel multimodal model that achieves universal UI understanding across various platforms through adaptive scaling and efficient grid optimization.
The paper introduces enhanced dataset construction using human-collected annotations and GPT-4o synthesized visual prompts to improve training quality.
The paper reports high benchmark scores in elementary and multi-round tasks, showcasing robust performance and zero-shot cross-platform transferability.

Universal User Interface Understanding with Ferret-UI 2

Introduction to Ferret-UI 2

The study introduces Ferret-UI 2, a multimodal LLM designed for universal user interface (UI) understanding across multiple platforms, namely iPhone, Android, iPad, Webpage, and AppleTV. This development builds on the foundation laid by Ferret-UI but introduces critical enhancements to overcome foundational challenges such as platform diversity, resolution variance, and data availability. Ferret-UI 2 integrates seamless UI interaction through single-step exchanges and improves high-resolution perception using adaptive scaling. The model benefits from advanced training data creation using GPT-4o and set-of-mark visual prompting strategies.

Figure 1: Real examples of a single Ferret-UI 2 model interacting with four different platforms (iPhone, iPad, Webpage, and AppleTV) for UI understanding.

Dataset Construction and Task Generation

A significant advancement in Ferret-UI 2 lies in dataset construction (Figure 2). Central to this improvement is the utilization of human-collected annotations or HTML parsed bounding boxes, rather than model-detected ones, enhancing data quality. The "Core-set" dataset drawn from diverse platform types is filtered to ensure high annotation quality, excluding non-ASCII texts and malformed bounding boxes. Task data generation involves elementary tasks (including referring and grounding tasks) and advanced tasks that are synthesized using GPT-4o with visual prompting for comprehensive UI understanding and user-centered task interaction.

Figure 2: Illustration of the Core-set data generation pipeline.

Advanced task data is enhanced by prompting GPT-4o to generate task data through visual prompting, marking specific UI widgets with minimalistic, corner-style bounding boxes paired with unique tags to facilitate interaction analysis (Figure 3).

Figure 3: Example of set-of-mark visual prompting with generated advanced task training example.

Model Architecture

Ferret-UI 2 builds on Ferret-UI by employing an innovative architecture (Figure 4), which features any-resolution capabilities for dynamic high-resolution encoding. The introduction of adaptive $N$ -gridding optimizes local image feature extraction by minimizing aspect distortion. By configuring grid sizes optimally, Ferret-UI 2 efficiently processes visual data while adhering to predefined inference cost limits, ensuring high-resolution support and interactions.

Figure 4: Overview of the Ferret-UI 2 model architecture, which allows for seamless UI understanding and user-centered single-step interactions with high-resolution support.

The adaptive $N$ -gridding is a vital component, employing an algorithm that efficiently determines grid sizes based on resolution distortion assessment.

Experimental Evaluation

The Ferret-UI 2 model demonstrates superior performance across various benchmarks, indicating its strong capability for UI understanding and interaction (Figure 5). Notably, it excels in elementary tasks and advanced multi-round perception and interaction QA tasks. The model enables effective cross-platform transfer, showcasing its adaptability and robustness.

Figure 5: Examples of visual prompting using GPT-4o to generate task data for Multi-Round Perception QA and Multi-Round Interaction QA.

Evaluation results demonstrate that Ferret-UI 2 achieves high scores in GPT-4o evaluations, outperforming existing models, and confirming its efficacy in diverse UI environments.

Cross-Platform Transferability

Ferret-UI 2's architecture allows for remarkable zero-shot cross-platform transferability, effectively learning from diverse content types and adapting to multiple platforms. This is evidenced by its consistent performance across various UI configurations, suggesting strong transfer learning capabilities that leverage similarities in data distribution and resolution among platforms.

Conclusions

Ferret-UI 2 establishes a robust framework for universal UI understanding across a wide range of platforms. Its advancements in dataset construction, adaptive scaling, and multimodal interaction provide significant improvements over prior models. The system’s strong performance coupled with its cross-platform capabilities points to promising future directions in the development of generalist agents for universal UI navigation and interaction, setting a foundation for continued AI advancements in the field of user interface understanding.

Markdown Report Issue