MobA: Multifaceted Memory-Enhanced Adaptive Planning for Efficient Mobile Task Automation

Published 17 Oct 2024 in cs.MA, cs.AI, cs.CL, and cs.HC | (2410.13757v3)

Abstract: Existing Multimodal LLM (MLLM)-based agents face significant challenges in handling complex GUI (Graphical User Interface) interactions on devices. These challenges arise from the dynamic and structured nature of GUI environments, which integrate text, images, and spatial relationships, as well as the variability in action spaces across different pages and tasks. To address these limitations, we propose MobA, a novel MLLM-based mobile assistant system. MobA introduces an adaptive planning module that incorporates a reflection mechanism for error recovery and dynamically adjusts plans to align with the real environment contexts and action module's execution capacity. Additionally, a multifaceted memory module provides comprehensive memory support to enhance adaptability and efficiency. We also present MobBench, a dataset designed for complex mobile interactions. Experimental results on MobBench and AndroidArena demonstrate MobA's ability to handle dynamic GUI environments and perform complex mobile tasks.

Abstract PDF HTML Upgrade to Chat

Summary

The paper introduces a two-level agent architecture that breaks down complex mobile tasks using MLLMs.
MobA employs a Global and Local Agent framework with task decomposition and a double-reflection mechanism to enhance decision-making.
Evaluated on MobBench, MobA achieves a 66.2% milestone score, outperforming traditional mobile automation assistants.

Overview of "-0.35 MobA: A Two-Level Agent System for Efficient Mobile Task Automation"

The paper introduces "MobA," an agent system that leverages multimodal LLMs (MLLMs) to enhance mobile task automation. MobA is structured around a two-level architecture comprising a Global Agent (GA) and a Local Agent (LA). This approach addresses the limitations encountered by traditional smart assistants and model-based screen agents, which often falter due to complex interfaces and inadequate decision-making capabilities.

Key Components and Methodology

Two-Level Agent Architecture: The Global Agent interprets commands and plans tasks by breaking them into simpler sub-tasks, whereas the Local Agent focuses on executing these actions via function calls. This architectural division mirrors human cognitive processes, allowing for more efficient multitasking and better system efficiency.
Task Decomposition and Execution: MobA employs a sophisticated task planning pipeline involving task decomposition, feasibility assessment, and result validation. Tasks are divided into sub-tasks, enabling the agent to handle complex commands through a structured, step-by-step approach. This results in significant improvements in task execution efficiency and completion rates.
Memory Module: MobA incorporates a multi-aspect memory system to enhance adaptability and reduce redundancy by learning from historical experiences. This includes not only task execution data but also user preferences and application-specific knowledge, providing a robust foundation for decision-making.
Double-Reflection Mechanism: This mechanism allows MobA to assess task feasibility before execution and evaluate success afterward, preventing ineffective actions and facilitating error correction.

Evaluation and Results

The paper reports MobA's evaluation using "MobBench," a test set with 50 real-life tasks varying in complexity. MobA achieved a milestone score rate of 66.2%, outperforming other baseline systems by a substantial margin. This underscores the efficacy of the two-level agent architecture and the integration of MLLM capabilities in task planning and execution.

Implications and Future Work

Theoretically, MobA's approach demonstrates how MLLMs can be effectively utilized in mobile automation tasks, providing a framework for intelligent agent systems that combine structured task decomposition with adaptive learning. Practically, MobA represents a significant advancement in mobile assistants, enhancing their ability to manage complex, real-world tasks.

Future developments could aim to optimize task decomposition algorithms, refine memory retrieval strategies, and enhance the system's capability to handle dynamic mobile environments. Furthermore, as MLLMs continue to evolve, their integration into systems like MobA could expand the potential of mobile assistants, providing more seamless and efficient user experiences.

In summary, MobA goes beyond traditional task automation systems by incorporating advanced reasoning, planning, and memory capabilities, setting a new standard for mobile task automation. This aligns with the growing need for more responsive and intelligent systems in mobile technology.