Papers
Topics
Authors
Recent
Search
2000 character limit reached

Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

Published 24 Oct 2024 in cs.CV, cs.CL, and cs.LG | (2410.18967v2)

Abstract: Building a generalist model for user interface (UI) understanding is challenging due to various foundational issues, such as platform diversity, resolution variation, and data limitation. In this paper, we introduce Ferret-UI 2, a multimodal LLM (MLLM) designed for universal UI understanding across a wide range of platforms, including iPhone, Android, iPad, Webpage, and AppleTV. Building on the foundation of Ferret-UI, Ferret-UI 2 introduces three key innovations: support for multiple platform types, high-resolution perception through adaptive scaling, and advanced task training data generation powered by GPT-4o with set-of-mark visual prompting. These advancements enable Ferret-UI 2 to perform complex, user-centered interactions, making it highly versatile and adaptable for the expanding diversity of platform ecosystems. Extensive empirical experiments on referring, grounding, user-centric advanced tasks (comprising 9 subtasks $\times$ 5 platforms), GUIDE next-action prediction dataset, and GUI-World multi-platform benchmark demonstrate that Ferret-UI 2 significantly outperforms Ferret-UI, and also shows strong cross-platform transfer capabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Agent-e: From autonomous web navigation to foundational design principles in agentic systems. arXiv preprint arXiv:2407.13032, 2024.
  2. Minigpt4-video: Advancing multimodal llms for video understanding with interleaved visual-textual tokens. arXiv preprint arXiv:2404.03413, 2024.
  3. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning. arXiv preprint arXiv:2406.11896, 2024.
  4. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  5. Windows agent arena: Evaluating multi-modal os agents at scale. arXiv preprint arXiv:2409.08264, 2024.
  6. Amex: Android multi-annotation expo dataset for mobile gui agents. arXiv preprint arXiv:2407.17490, 2024.
  7. Guide: Graphical user interface data for execution. arXiv preprint arXiv:2404.16048, 2024.
  8. Gui-world: A dataset for gui-oriented multimodal llm-based agents. arXiv preprint arXiv:2406.10819, 2024a.
  9. Websrc: A dataset for web-based structural reading comprehension. arXiv preprint arXiv:2101.09465, 2021.
  10. Mindsearch: Mimicking human minds elicits deep ai searcher. arXiv preprint arXiv:2407.20183, 2024b.
  11. Seeclick: Harnessing gui grounding for advanced visual gui agents. arXiv preprint arXiv:2401.10935, 2024.
  12. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  13. Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the 30th annual ACM symposium on user interface software and technology, pp.  845–854, 2017.
  14. Mind2web: Towards a generalist agent for the web. NeurIPS, 2024.
  15. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  16. Assistgui: Task-oriented desktop graphical user interface automation. arXiv preprint arXiv:2312.13108, 2023.
  17. A real-world webagent with planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856, 2023.
  18. Webvoyager: Building an end-to-end web agent with large multimodal models. arXiv preprint arXiv:2401.13919, 2024.
  19. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914, 2023.
  20. Chat-univi: Unified visual representation empowers large language models with image and video understanding. In CVPR, 2024.
  21. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. arXiv preprint arXiv:2402.17553, 2024.
  22. Autowebglm: Bootstrap and reinforce a large language model-based web navigating agent. arXiv preprint arXiv:2404.03648, 2024.
  23. Spotlight: Mobile ui understanding using vision-language models with a focus, 2023.
  24. Mvbench: A comprehensive multi-modal video understanding benchmark. In CVPR, 2024a.
  25. On the effects of data scale on computer control agents. arXiv preprint arXiv:2406.03679, 2024b.
  26. Appagent v2: Advanced agent for flexible mobile interactions. arXiv preprint arXiv:2408.11824, 2024c.
  27. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024a. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
  28. Visualwebbench: How far have multimodal llms evolved in web page understanding and grounding? arXiv preprint arXiv:2404.05955, 2024b.
  29. Visualagentbench: Towards large multimodal models as visual foundation agents. arXiv preprint arXiv:2408.06327, 2024c.
  30. Gui odyssey: A comprehensive dataset for cross-app gui navigation on mobile devices. arXiv preprint arXiv:2406.08451, 2024.
  31. Laser: Llm agent with state-space exploration for web navigation. arXiv preprint arXiv:2309.08172, 2023.
  32. Mm1: Methods, analysis & insights from multimodal llm pre-training. arXiv preprint arXiv:2403.09611, 2024.
  33. Mobileflow: A multimodal llm for mobile gui agent. arXiv preprint arXiv:2407.04346, 2024.
  34. Webcanvas: Benchmarking web agents in online environments. arXiv preprint arXiv:2406.12373, 2024.
  35. Androidworld: A dynamic benchmarking environment for autonomous agents. arXiv preprint arXiv:2405.14573, 2024a.
  36. Androidinthewild: A large-scale dataset for android device control. NeurIPS, 2024b.
  37. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
  38. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024.
  39. Androidenv: A reinforcement learning platform for android. arXiv preprint arXiv:2105.13231, 2021.
  40. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration. arXiv preprint arXiv:2406.01014, 2024a.
  41. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception, 2024b.
  42. Mobileagentbench: An efficient and user-friendly benchmark for mobile llm agents. arXiv preprint arXiv:2406.08184, 2024c.
  43. Autodroid: Llm-powered task automation in android. In Proceedings of the 30th Annual International Conference on Mobile Computing and Networking, pp.  543–557, 2024.
  44. Webui: A dataset for enhancing visual ui understanding with web semantics. ACM Conference on Human Factors in Computing Systems (CHI), 2023.
  45. Os-copilot: Towards generalist computer agents with self-improvement. arXiv preprint arXiv:2402.07456, 2024.
  46. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. arXiv preprint arXiv:2404.07972, 2024.
  47. Crab: Cross-environment agent benchmark for multimodal language model agents. arXiv preprint arXiv:2407.01511, 2024.
  48. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023.
  49. Webshop: Towards scalable real-world web interaction with grounded language agents. NeurIPS, 2022.
  50. Ferret: Refer and ground anything anywhere at any granularity, 2023.
  51. Ferret-ui: Grounded mobile ui understanding with multimodal llms. arXiv preprint arXiv:2404.05719, 2024.
  52. Ufo: A ui-focused agent for windows os interaction. arXiv preprint arXiv:2402.07939, 2024a.
  53. Mobile-env: an evaluation platform and benchmark for llm-gui interaction. arXiv preprint arXiv:2305.08144, 2023.
  54. Mm1. 5: Methods, analysis & insights from multimodal llm fine-tuning. arXiv preprint arXiv:2409.20566, 2024b.
  55. Ferret-v2: An improved baseline for referring and grounding with large language models. arXiv preprint arXiv:2404.07973, 2024c.
  56. Gpt-4v (ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024a.
  57. Synapse: Trajectory-as-exemplar prompting with memory for computer control. In ICLR, 2023.
  58. Agentstudio: A toolkit for building general virtual agents. arXiv preprint arXiv:2403.17918, 2024b.
  59. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023.
Citations (3)

Summary

  • The paper demonstrates a novel multimodal model that achieves universal UI understanding across various platforms through adaptive scaling and efficient grid optimization.
  • The paper introduces enhanced dataset construction using human-collected annotations and GPT-4o synthesized visual prompts to improve training quality.
  • The paper reports high benchmark scores in elementary and multi-round tasks, showcasing robust performance and zero-shot cross-platform transferability.

Universal User Interface Understanding with Ferret-UI 2

Introduction to Ferret-UI 2

The study introduces Ferret-UI 2, a multimodal LLM designed for universal user interface (UI) understanding across multiple platforms, namely iPhone, Android, iPad, Webpage, and AppleTV. This development builds on the foundation laid by Ferret-UI but introduces critical enhancements to overcome foundational challenges such as platform diversity, resolution variance, and data availability. Ferret-UI 2 integrates seamless UI interaction through single-step exchanges and improves high-resolution perception using adaptive scaling. The model benefits from advanced training data creation using GPT-4o and set-of-mark visual prompting strategies. Figure 1

Figure 1: Real examples of a single Ferret-UI 2 model interacting with four different platforms (iPhone, iPad, Webpage, and AppleTV) for UI understanding.

Dataset Construction and Task Generation

A significant advancement in Ferret-UI 2 lies in dataset construction (Figure 2). Central to this improvement is the utilization of human-collected annotations or HTML parsed bounding boxes, rather than model-detected ones, enhancing data quality. The "Core-set" dataset drawn from diverse platform types is filtered to ensure high annotation quality, excluding non-ASCII texts and malformed bounding boxes. Task data generation involves elementary tasks (including referring and grounding tasks) and advanced tasks that are synthesized using GPT-4o with visual prompting for comprehensive UI understanding and user-centered task interaction. Figure 2

Figure 2: Illustration of the Core-set data generation pipeline.

Advanced task data is enhanced by prompting GPT-4o to generate task data through visual prompting, marking specific UI widgets with minimalistic, corner-style bounding boxes paired with unique tags to facilitate interaction analysis (Figure 3). Figure 3

Figure 3: Example of set-of-mark visual prompting with generated advanced task training example.

Model Architecture

Ferret-UI 2 builds on Ferret-UI by employing an innovative architecture (Figure 4), which features any-resolution capabilities for dynamic high-resolution encoding. The introduction of adaptive NN-gridding optimizes local image feature extraction by minimizing aspect distortion. By configuring grid sizes optimally, Ferret-UI 2 efficiently processes visual data while adhering to predefined inference cost limits, ensuring high-resolution support and interactions. Figure 4

Figure 4: Overview of the Ferret-UI 2 model architecture, which allows for seamless UI understanding and user-centered single-step interactions with high-resolution support.

The adaptive NN-gridding is a vital component, employing an algorithm that efficiently determines grid sizes based on resolution distortion assessment.

Experimental Evaluation

The Ferret-UI 2 model demonstrates superior performance across various benchmarks, indicating its strong capability for UI understanding and interaction (Figure 5). Notably, it excels in elementary tasks and advanced multi-round perception and interaction QA tasks. The model enables effective cross-platform transfer, showcasing its adaptability and robustness. Figure 5

Figure 5: Examples of visual prompting using GPT-4o to generate task data for Multi-Round Perception QA and Multi-Round Interaction QA.

Evaluation results demonstrate that Ferret-UI 2 achieves high scores in GPT-4o evaluations, outperforming existing models, and confirming its efficacy in diverse UI environments.

Cross-Platform Transferability

Ferret-UI 2's architecture allows for remarkable zero-shot cross-platform transferability, effectively learning from diverse content types and adapting to multiple platforms. This is evidenced by its consistent performance across various UI configurations, suggesting strong transfer learning capabilities that leverage similarities in data distribution and resolution among platforms.

Conclusions

Ferret-UI 2 establishes a robust framework for universal UI understanding across a wide range of platforms. Its advancements in dataset construction, adaptive scaling, and multimodal interaction provide significant improvements over prior models. The system’s strong performance coupled with its cross-platform capabilities points to promising future directions in the development of generalist agents for universal UI navigation and interaction, setting a foundation for continued AI advancements in the field of user interface understanding.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 39 likes about this paper.