MemGUI-Bench: Memory-Centric GUI Benchmark
- MemGUI-Bench is a memory-centric benchmark that systematically evaluates memory retention and recall in mobile GUI agents using a structured taxonomy.
- It features a suite of 128 tasks with mirror-task pairs that assess both short- and long-term memory across dynamic mobile application scenarios.
- The framework employs innovative pass@k protocols and hierarchical metrics to reveal memory deficits and guide architectural improvements.
MemGUI-Bench is a memory-centric benchmark, task suite, and evaluation framework designed for rigorous assessment of autonomous mobile GUI agents in dynamic environments. It was developed in response to critical gaps identified in prevailing benchmarks, including low prevalence of memory-demanding tasks, insufficient support for multi-attempt learning protocols, and limited scalability of evaluation methodologies. MemGUI-Bench introduces a systematic memory taxonomy, a suite of 128 tasks emphasizing memory retention and recall, a comprehensive LLM-powered evaluation pipeline, and a set of hierarchical metrics for short-term fidelity, long-term learning, and efficiency. Experimental results across 11 state-of-the-art agents reveal significant memory deficits and inform actionable architectural recommendations for advancing memory competence in mobile GUI automation (Liu et al., 3 Feb 2026).
1. Memory Taxonomy and Agent Architectures
MemGUI-Bench defines memory in the mobile GUI context as the ability to retain, process, and utilize contextual and experiential information to enhance task performance over time. The taxonomy distinguishes short-term (in-session) and long-term (cross-session) memory:
- Short-Term Memory concerns temporary retention of intermediate results, UI states, or multi-step data (e.g., a verification code or product price). Architectures are classified as:
- Memory Agent: Deploys dedicated summarizer modules (e.g., T3A, M3A, Agent-S2).
- Action-Thought Pattern: Combines action and reasoning outputs (e.g., AppAgent, GUI-Owl).
- Multi-Turn Context: Concatenates dialogue history (e.g., UI-TARS).
- Rule-Based Aggregation: Employs hard-coded context rules (e.g., SeeAct).
- No Historical Context: Stateless decision-making (e.g., CogAgent).
- Long-Term Memory is further divided into:
- Success-Based Learning: Extracts reusable shortcuts from successful attempts (e.g., Mobile-Agent-E).
- Failure-Based Learning: Analyzes missteps for improved future avoidance (e.g., Agent-S2).
This taxonomy enables systematic agent characterization and supports targeted evaluation of memory mechanisms unique to mobile UI automation.
2. Task Suite Composition and Memory Challenge Design
MemGUI-Bench comprises 128 rigorously curated tasks distributed across 26 real-world mobile applications. A total of 89.8% of tasks explicitly stress memory via cross-temporal (within-session) and cross-spatial (cross-app) requirements. The task suite is organized into 64 mirror-task pairs to enable cross-session learning assessment. Representative characteristics include:
- Average trajectory length: 36.2 golden steps (range: 3–160).
- Difficulty stratification: 37.5% easy (≤ 20 steps), 32.8% medium (21–40), 29.7% hard (> 40).
- Cross-app complexity: 28 single-app, 56 two-app, 34 three-app, 10 four-app tasks.
- Memory challenge stratification: 115 tasks are memory-intensive, with 13 standard baseline tasks utilized for computing the Memory-Task Proficiency Ratio (MTPR).
- Example: "Find and compare three webcams on Amazon, compute value scores in Joplin" paired with a structurally analogous task with different entities.
The structure and diversity of the task suite systematically probe both short- and long-term memory, distinguishing MemGUI-Bench from earlier benchmarks in mobile GUI automation.
3. Pass@k Evaluation Protocol and Hierarchical Metrics
MemGUI-Bench introduces a pass@k evaluation protocol to quantify both immediate and longitudinal learning. The protocol includes:
- Single-Attempt Success Rate (SR):
- Multi-Attempt Success Rate (pass@k SR):
where is the number of tasks solved within attempts.
- Failure Recovery Rate (FRR):
where is the number of tasks failed on the first attempt and denotes those first succeeded on attempt .
Seven hierarchical metrics are used, spanning three principal dimensions:
- Short-Term Fidelity: SR, Information Retention Rate (IRR), MTPR.
- IRR for task :
where is the count of correctly recalled info units, total units. - Average IRR:
- MTPR comparing memory vs. baseline proficiency:
- Long-Term Learning: pass@k SR, FRR.
- Execution Efficiency: Average Step Ratio (), Average Time/Step, Average Cost/Step.
This multi-faceted metric schema enables comparison of short-term recall, longitudinal improvement, and computational efficiency.
4. MemGUI-Eval: Progressive Scrutiny Evaluation Pipeline
To address scalability and reliability limitations in prior evaluation methods, MemGUI-Eval deploys a three-stage, LLM-driven "Progressive Scrutiny" pipeline featuring four specialized roles—Triage Judge (⧫), Step Descriptor (☆), Semantic Judge (♥), Visual Judge (▶), plus an IRR Analyzer (▷):
- Triage Judge: Screens limited artifacts (task, logs, three screenshots), deterministically issuing "Success" only on irrefutable evidence, else "Uncertain."
- Semantic Judge: Utilizes Step Descriptor–generated fine-grained descriptions (action, UI) per before-after screenshot pair. Conducts exhaustive requirement checks against final state synthesis; ambiguities trigger explicit information requests or intermediate IRR analysis.
- Visual Judge: Engages only with minimal, query-driven historical screenshots. Delivers final binary decision; triggers IRR Analyzer for memory-specific failures.
This progressive system achieves F1 ≈ 95–99% on both SPA-Bench and MemGUI-Bench at modest cost ($0.03–$0.07 per trajectory).
5. Experimental Findings and Agent Benchmarking
Eleven state-of-the-art agents were benchmarked, including explicit long-term memory architectures (Agent-S2, Mobile-Agent-E), framework-based systems (T3A, M3A, Mobile-Agent-V2, SeeAct, AppAgent), and end-to-end models (UI-Venus, UI-TARS, GUI-Owl, CogAgent). Key results:
- Single-Attempt Success Rate (SR@1) on memory tasks: 0–41.7%.
- Multi-Attempt Success Rate (SR@3): 0–49.2%.
- Framework vs. Model divide: Agent-workflow systems achieve 22.7–32.8% SR@1; end-to-end models only 0–6.2%.
- Benchmark-induced overestimation: Performance of agents such as Agent-S2 drops from 54.3% (AndroidWorld) to 27.3% (MemGUI-Bench), yielding a 4–10× gap as measured by MTPR (top agent MTPR ≈ 0.45; most < 0.1).
- Complexity penalty: SR drops by 16–40 percentage points from one-app to four-app tasks.
- Long-context advantage: Activating multi-turn context in M3A increases SR by 18.8 percentage points (32.8% to 51.6%).
- Long-term memory gain: Agent-S2 improves by 21.9 percentage points (27.3% → 49.2% SR@3), FRR = 21.5%.
- Efficiency trade-off: High-token agents (Agent-S2: 41,760 tokens/step) collapse under tight token budgets (0% performance), while M3A (12,960 tokens/step) degrades gracefully (47.7% → 21.9% SR@3).
These outcomes expose structural weaknesses in prevailing agent approaches, particularly in scaling memory and coping with multi-app task sequences.
6. Failure Modes and Diagnostic Taxonomy
Among 343 non-timeout failures, five primary memory-related failure modes account for an average 58.9%:
- Partial Memory Hallucination (PMH): Partial item recall ().
- Process Memory Hallucination (ProcMH): Loss of multi-step workflow context; premature conclusion.
- Output Memory Hallucination (OMH): Correct navigation with incomplete/wrong transcription ().
- Knowledge Deficiency (KD): Fundamental misunderstanding outside memory scope (e.g., wrong app identification).
- Intent Misunderstanding (IM): Substep correctness but misaligned with user goals.
Representative failures include UI-TARS’s inefficient deletions (timeout), PMH on stock price recall, ProcMH on incomplete process, OMH in permission lists (M3A), and CogAgent’s lack of "wait" action.
7. Architectural Implications and Open Research Directions
Empirical findings support five actionable architectural principles for memory-proficient GUI agents:
- Multi-Granularity Memory Buffers: Dedicated slots for numerical, UI, and textual elements to prevent partial hallucinations.
- Hierarchical Task Decomposition: Persistent high-level and dynamic subgoals to address process drift.
- Strategic Long-Context Utilization: Selective context compression, fully leveraging large token windows (100K–1M tokens).
- Explicit Long-Term Memory Modules: Structured cross-session learning from both successes and failures.
- Hybrid Framework-Model Architectures: Combinations of lightweight end-to-end models for routine actions and framework-level memory controls for memory-intensive segments, optimizing deployment cost and capability.
MemGUI-Bench—including code, tasks, and evaluation infrastructure—is fully open-sourced and regularly maintained, offering a foundational resource for advancing memory-enabled mobile GUI agents toward human-level proficiency (Liu et al., 3 Feb 2026).