FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

Published 1 Sep 2025 in cs.AI, cs.CL, and cs.CV | (2509.01052v1)

Abstract: GUI agents powered by LLMs show promise in interacting with diverse digital environments. Among these, video games offer a valuable testbed due to their varied interfaces, with adventure games posing additional challenges through complex, narrative-driven interactions. Existing game benchmarks, however, lack diversity and rarely evaluate agents on completing entire storylines. To address this, we introduce FlashAdventure, a benchmark of 34 Flash-based adventure games designed to test full story arc completion and tackle the observation-behavior gap: the challenge of remembering and acting on earlier gameplay information. We also propose CUA-as-a-Judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves milestone completion by bridging the observation-behavior gap. Nonetheless, a marked discrepancy between humans and best-performing agents warrants continued research efforts to narrow this divide.

Abstract PDF Upgrade to Chat

Summary

The paper introduces AdventureBench, a comprehensive benchmark designed to evaluate GUI agents on solving complete narrative story arcs across diverse adventure game subgenres.
It presents the COAST framework that decomposes gameplay into a Seek-Map-Solve cycle and employs persistent clue memory to bridge long-term observation-behavior gaps.
Empirical results reveal that even state-of-the-art models struggle with planning, perception, and lateral reasoning, highlighting a significant performance gap with humans.

AdventureBench: A Comprehensive Benchmark for GUI Agents in Full-Arc Adventure Games

Motivation and Benchmark Design

The paper introduces AdventureBench, a large-scale benchmark specifically designed to evaluate the capabilities of GUI agents—particularly those powered by LLMs and VLMs—in solving full story arcs in classic adventure games. Unlike prior benchmarks, which are limited in both diversity and narrative scope, AdventureBench comprises 34 Flash-based adventure games spanning multiple subgenres (mystery/detective, hidden object, room escape, visual novel, simulation). The benchmark is constructed to stress-test agents' abilities in long-horizon planning, memory, and reasoning, with a particular focus on the observation-behavior gap: the challenge of acting on information observed many steps earlier.

Figure 1: AdventureBench consists of 34 Flash-based classic adventure games and supports automatic evaluation of the GUI agent using CUA-as-a-Judge.

The selection of Flash games is motivated by their compact, self-contained story arcs (typically solvable by humans in under two hours), which makes them tractable for both human and agent evaluation. The games are chosen to maximize diversity in both gameplay mechanics and cognitive demands, as evidenced by the inter- and intra-subgenre analyses.

The Observation-Behavior Gap

A central challenge highlighted by AdventureBench is the observation-behavior gap: the temporal lag between when an agent observes a clue and when it must act on it, often hundreds of steps later. Human gameplay data collected for the benchmark demonstrates that players routinely bridge gaps of 100–400+ steps between observation and action, especially in mystery and room escape games.

Figure 2: Comparison of gameplay progression across (a) VisEscape, (b) Cradle, and (c) AdventureBench. Prior benchmarks focus on short-term objectives or include short story arcs, limiting their ability to fully evaluate agents’ capacity to manage the long-term observation-behavior gap. In contrast, AdventureBench emphasizes completion of full story arcs involving long-term objectives, exemplified by suspect interrogations leading to a verdict.

Figure 3: Examples illustrating the long-term observation-behavior gap from human player demonstrations in Sherlock Holmes: The Tea Shop Murder Mystery. (1-a) The player acquired a newspaper after 118 steps. (1-b) The player gave the newspaper to a man and received a voucher after 217 steps, resulting in a step gap of 99 between acquiring the newspaper and receiving the voucher. (2-a) The player acquired a duster after 43 steps. (2-b) The player gave the duster to a woman, unlocking new suspects after 464 steps, with a step gap of 421 between acquiring the duster and unlocking new suspects.

This property is largely absent from prior benchmarks, which either focus on short-term tasks or artificially constructed environments with minimal narrative dependencies.

Automated Evaluation: CUA-as-a-Judge

Manual evaluation of agent progress in complex games is a major bottleneck. The paper introduces CUA-as-a-Judge, an automated evaluation agent built on top of Claude-3.7-Sonnet's computer-use capabilities. This agent interacts with the game environment post-episode, verifying milestone completion by executing GUI actions (e.g., opening inventory, checking scores) and comparing observed states to predefined success criteria.

Figure 4: CUA-as-a-Judge verifies game progress by interacting with the environment. Left: (1-a) The judge clicks the Stats'' button in Pico Sim Date to (1-b) verify the character's affection score. Right: (2-a) The judge clicks theNotebook'' item in Sherlock Holmes: The Tea Shop Murder Mystery to (2-b) count the number of suspects (5) identified by the player, confirming milestone completion.

Empirical validation shows that CUA-as-a-Judge achieves 94% agreement with human annotators, with near-perfect correlation on milestone completion rates, enabling scalable and reproducible evaluation.

The COAST Framework: Bridging Long-Term Dependencies

To address the observation-behavior gap, the authors propose COAST (Clue-Oriented Agent for Sequential Tasks), a modular agentic framework that explicitly manages long-term clue memory and decomposes gameplay into a Seek-Map-Solve cycle:

Figure 5: Overview of COAST Framework with Seek-Map-Solve Cycle.

Clue Seeker: Proactively explores the environment to collect clues, storing them in a persistent memory buffer.
Clue Mapper: Periodically analyzes the memory and trajectory to generate plausible subtask hypotheses by matching clues to past observations, leveraging abductive and lateral reasoning.
Problem Solver: Executes subtasks derived from clue-observation mappings, updating the resolved goal set and memory.

This architecture is instantiated using Claude-3.7-Sonnet Computer-Use for perception and action, with all modules communicating via structured prompts and JSON-based memory representations. The clue memory is unbounded in practice, as the total token count remains well within modern LLM context limits.

Experimental Results and Analysis

Baseline Performance

A comprehensive evaluation is conducted across seven agent configurations, including proprietary end-to-end agents (Claude-3.7-Sonnet Computer-Use, OpenAI CUA), open-source models (UI-TARS-1.5-7B), and modular frameworks (Cradle, Agent S2) with various VLM backbones and GUI grounding modules.

Key findings:

All agents, including state-of-the-art proprietary models, exhibit near-zero success rates on full story arc completion, with milestone completion rates far below human baselines.
The only exceptions are two hidden object games, which are solved by some agents via visual pattern matching.
Human players achieve a 97% success rate and 100% milestone completion with an average of 1,142 steps per game.
Figure 6: Comparison of average milestone completion rates (MCR) across different game subgenres for three GUI agents.

Failure Modes

Three primary failure patterns are identified:

Weak Planning: Agents repeat actions, revisit locations, and fail to leverage past clues for future planning.
Poor Visual Perception: Agents misinterpret non-standard layouts, leading to missed interactions even in visually simple games.
Deficient Lateral Thinking: Agents struggle with creative, non-obvious subtask generation, often failing to connect clues to their eventual use.

COAST Improvements

COAST demonstrates measurable improvements:

Milestone completion rate increases by up to 2.78 percentage points over the best baseline.
Success rate increases by 5.88 percentage points, with the largest gains in subgenres with substantial observation-behavior gaps (mystery/detective, room escape).
Ablation studies confirm that all three modules (Seeker, Mapper, Solver) are necessary for optimal performance; removing the Mapper, in particular, degrades performance in clue-rich games.

However, COAST does not consistently outperform baselines in visual novel subgenres, where the observation-behavior gap is minimal and resource management dominates.

Failure Analysis

Manual inspection reveals that COAST mitigates planning and lateral thinking failures but does not address perception errors (inherited from the backbone VLM) or resource management issues in simulation-heavy games.

Practical and Theoretical Implications

Practical Implications:

Benchmarking: AdventureBench provides a scalable, diverse, and challenging testbed for GUI agents, with automated evaluation and a focus on long-horizon, narrative-driven tasks.
Agent Design: The COAST framework demonstrates the necessity of explicit long-term memory and structured subtask planning for progress in complex environments.
Evaluation: CUA-as-a-Judge enables reproducible, fine-grained assessment of agent progress, reducing reliance on costly human annotation.

Theoretical Implications:

The persistent gap between human and agent performance, even with advanced LLMs, highlights fundamental limitations in current architectures' ability to bridge long-term dependencies and perform abductive/lateral reasoning.
The knowing-doing gap is empirically demonstrated: even when LLMs have access to contaminated (pretrained) knowledge about a game, they fail to translate this into effective action sequences.

Limitations and Future Directions

Manual Milestone Definition: While the process is lightweight, it is not fully scalable or genre-agnostic. Future work could automate milestone extraction via narrative flow analysis or structured story representations.
Genre Coverage: The benchmark is not directly applicable to fast-paced, reflex-oriented games (e.g., platformers), as CUA-as-a-Judge relies on discrete, verifiable milestones.
Memory Scalability: COAST currently stores all clues; as action horizons increase, memory management strategies (summarization, retrieval, forgetting) will become necessary.
API Cost: Large-scale evaluation with proprietary models is expensive, underscoring the need for efficient open-source agents.

Conclusion

AdventureBench establishes a new standard for evaluating GUI agents in complex, narrative-driven environments, exposing critical deficiencies in current LLM-based agents' planning, perception, and reasoning capabilities. The COAST framework offers a principled approach to bridging long-term dependencies, but a substantial gap to human-level performance remains. The benchmark, evaluation tools, and agentic insights provided by this work will inform the next generation of research on generalist, memory-augmented, and reasoning-capable agents.

Markdown Report Issue