OSWorld and AndroidWorld Benchmarks

Updated 7 February 2026

OSWorld-Verified is a comprehensive desktop benchmark featuring diverse, programmatically verifiable tasks that rigorously test GUI agent capabilities.
AndroidWorld is a dynamic mobile benchmark with parameterized tasks in real apps, challenging agents to plan and act under varying workflows.
Together, these benchmarks drive advances in generalist computer-use agents by employing scalable evaluation protocols and robust, reproducible validations.

OSWorld-Verified and AndroidWorld are high-fidelity, large-scale benchmarks designed for evaluating autonomous GUI agents across desktop and mobile environments, respectively. These benchmarks provide end-to-end platforms with hundreds of real-world, programmatically verifiable tasks, enabling rigorous, reproducible assessment of agents’ abilities to ground, plan, and act within actual operating systems and apps. Each presents distinct challenges: OSWorld-Verified focuses on task diversity and system heterogeneity in real desktop environments, while AndroidWorld emphasizes dynamic, parameterized mobile workflows in real Android apps. Together, they form the contemporary foundation for empirical advances in generalist computer-use agents.

1. Benchmark Design and Scope

OSWorld-Verified is the programmatically-verifiable subset of OSWorld, originally introduced by Xie et al. (Xie et al., 2024). OSWorld-Verified contains 369 tasks on Ubuntu (and optionally Windows/macOS) that span file management, system configuration, document editing, application control, and multi-app workflows. Each task is defined by an initial VM snapshot, setup scripts, and a deterministic evaluation function that inspects artifacts such as files, accessibility trees, or application states (Xie et al., 2024, Cui et al., 31 Jan 2026). The action space consists of discrete GUI operations (e.g., click, type, drag, hotkey), optionally extended with shell or code execution calls in some agents (Gonzalez-Pumariega et al., 2 Oct 2025, Cui et al., 31 Jan 2026).

AndroidWorld comprises 116 tasks covering 20 stock Android apps, including productivity, communication, navigation, and multimedia tasks (Rawles et al., 2024). Each task is a dynamic, parameterized template instantiated with specific inputs (e.g., contact names, dates) at runtime. Environment interaction uses ADB for event injection and system state queries. State observation is provided as high-resolution screenshots (2400×1080×3 RGB) and, in some variants, accessibility trees (Rawles et al., 2024). The action space aligns with native user gestures (tap, swipe, scroll, type) and device-level navigation.

Tasks in both benchmarks vary widely in complexity: from atomic actions (single dialog navigation) to long-horizon, multi-application workflows requiring persistent memory and context-aware planning. Parameter randomization and initial state shuffling ensure that agents cannot overfit to static templates (Rawles et al., 2024).

2. Evaluation Protocols and Metrics

Task completion in OSWorld-Verified and AndroidWorld is measured by end-to-end success as determined by scripted verifiers that inspect post-episode system state rather than brittle pixel or semantic matching (Xie et al., 2024, Rawles et al., 2024). For each task $i$ with success indicator $s_i \in \{0,1\}$ , the primary metric is:

$\textrm{Success Rate} = \frac{1}{N}\sum_{i=1}^N s_i \times 100\%$

In AndroidWorld, a given task may also have gradated reward for multi-step goals, formally:

$r = \frac{\mathbb{1}[\textrm{goal}_1] + \mathbb{1}[\textrm{goal}_2]}{2}$

Pass@ $k$ metrics are reported when agents can attempt tasks multiple times per seed:

$\mathrm{pass}@k = 1 - (1 - \mathrm{SR})^k$

Verifier accuracy is measured via precision, recall, F1, and overall agreement with human or ground-truth script labeling:

$\mathrm{Precision} = \frac{\sum_i [\hat{R}_i=1 \wedge y_i=1]}{\sum_i [\hat{R}_i=1]}$

$\mathrm{Recall} = \frac{\sum_i [\hat{R}_i=1 \wedge y_i=1]}{\sum_i [y_i=1]}$

$\mathrm{F1} = 2\frac{\mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}$

Test time scaling is a central protocol: increasing the allowed number of steps or rollouts generally yields monotonic accuracy improvements, particularly for agents leveraging best-of- $N$ or read-only reproduction (Gonzalez-Pumariega et al., 2 Oct 2025, Cui et al., 31 Jan 2026). These protocols mitigate effects of non-determinism, environment crashes, and task stochasticity.

3. Baseline Results and State-of-the-Art Performance

Recent studies provide comprehensive performance baselines. On OSWorld-Verified, best-performing systems with full best-of- $s_i \in \{0,1\}$ 0 scaling approach human-level accuracy. On AndroidWorld, elite agents achieve success rates near or above 80%. Below is a synthesized performance summary (Pass@1 unless otherwise noted):

Model/System	OSWorld-Verified (%)	AndroidWorld (%)
Human	72.4	80.0
Surfer 2	60.1	87.1
UI-TARS-2	47.5	73.3
Step-GUI-8B	48.5 (Pass@3)	80.2 (Pass@3)
Mobile-Agent-v3	37.7	73.3
Ferret-UI Lite	19.8 (50 steps)	28.0

Notable scaling methods include Behavior Best-of-N (bBoN) (Gonzalez-Pumariega et al., 2 Oct 2025), achieving 69.9% on OSWorld-Verified with GPT-5 (N=10), and VAGEN, an agentic verifier that raises verification accuracy by 5–10 points on both benchmarks (Cui et al., 31 Jan 2026). UI-TARS-2 and MobileRL-9B set strong open-source records (Wang et al., 2 Sep 2025, Ye et al., 21 Aug 2025).

Performance on AndroidWorld tends to be higher than OSWorld-Verified for most agents, due to more regular mobile UI structure and shorter task horizons. Narrower agent architectures and smaller models significantly lag, especially on desktop, as seen with Ferret-UI Lite (Yang et al., 30 Sep 2025).

4. Methodological Innovations and Design Principles

Both benchmarks are uniquely distinguished by their formalization of real-system, open-ended tasks and programmatic outcome validation (Xie et al., 2024, Rawles et al., 2024, Cui et al., 31 Jan 2026). OSWorld-Verified embraces desktop task heterogeneity and partial observability, necessitating agents with robust GUI grounding, action composition, and operational knowledge. AndroidWorld is characterized by its dynamic, parameterized task generation system, yielding effectively infinite unique task instances, and hermetic initialization and verification logic guaranteeing reproducibility over random draws (Rawles et al., 2024).

Proactive agentic verification approaches—such as VAGEN—demonstrate superior verification through tool-based probing (e.g., shell, python, GUI tool use), overcoming visual evidence deficiencies in partially observable settings (Cui et al., 31 Jan 2026). Read-only majority voting and best-of- $s_i \in \{0,1\}$ 1 actor selection further stabilize evaluation and maximize sample efficiency.

In contrast, earlier baseline or LLM-as-a-Judge paradigms rely on passive, screenshot-only analysis, which fails to robustly discriminate between alternate but correct execution paths or latent state achievement (Cui et al., 31 Jan 2026). The importance of execution-based, not representation-based, validation is repeatedly evidenced across ablation studies.

5. Analysis of Success and Failure Modes

Success in both benchmarks is highly sensitive to the quality of spatial grounding, stepwise contextual memory, and the agent’s ability to recover from intermediate errors. OSWorld-Verified agents most often fail due to inaccurate click localization, misinterpretation of task objectives, or inability to recover from modality or window state shifts (Xie et al., 2024, Yang et al., 30 Sep 2025). In AndroidWorld, although structural regularity aids agent performance, stochasticity in app data, emulator state, or parameterization can induce variance of up to 20+ percentage points for certain seeds (Rawles et al., 2024).

Progressive verification and reflective reasoning chains are effective: VAGEN’s multi-stage static, visual, and proactive validation schema allows rapid dismissal of trivial errors and direct investigation of hidden states (Cui et al., 31 Jan 2026). Behavior narrative construction in bBoN similarly guides model selection among divergent rollouts, allowing state-of-the-art sampling efficiency (Gonzalez-Pumariega et al., 2 Oct 2025).

Test-time scaling further reveals that most agents continue to gain accuracy, with diminishing returns, as step or rollout budget increases. Desktop tasks with longer horizons and greater state branching present greater challenges than mobile-native tasks (Andreux et al., 22 Oct 2025).

6. Benchmark Impact and Implications

OSWorld-Verified and AndroidWorld have catalyzed a shift toward realistic, multi-operating-system generalist agents and highlighted the necessity of robust reward modeling and verifiable evaluation. OSWorld-Verified’s suite of system-level, execution-checked tasks has become a canonical benchmark for desktop agent architecture and RL method development (Xie et al., 2024, Gonzalez-Pumariega et al., 2 Oct 2025). AndroidWorld fills the prior gap in mobile-agent evaluation by enforcing dynamic, system-level validation across real apps, supporting research in mobile-specific planning, UI grounding, and parameter robustness (Rawles et al., 2024).

The observation that “GUI tasks are easy to verify but hard to solve” remains central. Verifier sophistication—particularly agentic, interactive approaches—translates directly to downstream policy improvements via RL with verifiable rewards (RLVR) and test-time selection (Cui et al., 31 Jan 2026, Yan et al., 17 Dec 2025).

A plausible implication is that further unification of desktop, mobile, and web benchmarks under a dynamic, RL-compatible reward and scaling framework will become the standard for evaluating next-generation foundation agents.

7. Open Challenges and Future Directions

Despite rapid progress, desktop environments in OSWorld-Verified remain more brittle for agents due to high GUI heterogeneity, multi-window complexity, and long-horizon tasks (Xie et al., 2024, Andreux et al., 22 Oct 2025). Programmatic evaluation itself can yield “false negatives” when agents complete tasks via alternative valid strategies not covered by script logic, motivating ongoing work in more flexible, semantic verification (Cui et al., 31 Jan 2026).

AndroidWorld’s infinite parameterization exposes agents to rare and adversarial task configurations; mean-over-seeds and pass@k are now viewed as more robust reporting practices than single-seed results (Rawles et al., 2024).

A plausible implication is increased research attention to self-evolving, trajectory-level reward pipelines, agentic interactive verifiers, and unified sandboxed environments capable of spanning desktop, mobile, and web domains at scale (Yan et al., 17 Dec 2025, Ye et al., 21 Aug 2025). Expansion to handle non-verifiable, truly open-ended tasks remains an open research challenge.

References:

(Xie et al., 2024, Rawles et al., 2024, Gonzalez-Pumariega et al., 2 Oct 2025, Andreux et al., 22 Oct 2025, Ye et al., 21 Aug 2025, Yang et al., 30 Sep 2025, Yan et al., 17 Dec 2025, Wang et al., 2 Sep 2025, Cui et al., 31 Jan 2026)