OSWorld-MCP Benchmark Suite

Updated 5 February 2026

OSWorld-MCP is a benchmark suite that evaluates multimodal computer-use agents by integrating GUI manipulations with MCP tool invocations in realistic desktop settings.
It uses a combination of automated code generation, manual curation, and rigorous task validation across diverse applications like education, creativity, and productivity.
The infrastructure emphasizes reproducibility and security through detailed workload annotations, performance metrics, and proactive threat mitigation strategies.

OSWorld-MCP is a benchmark and infrastructure suite designed for evaluating multimodal computer-use agents (CUAs) in realistic desktop environments. It explicitly targets the assessment of both GUI-manipulation skills and tool invocation (via the Model Context Protocol, MCP), aiming to provide a comprehensive, fair, and reproducible evaluation of next-generation AI agents’ digital tool-usage abilities. By combining automated code-generated tools, curated open-source MCP toolsets, detailed workload annotations, rigorous task validation, and a security-aware design, OSWorld-MCP sets a new standard for benchmarking the complex interplay of perception, reasoning, and tool co-ordination in multimodal agents (Jia et al., 28 Oct 2025, Yan et al., 9 Jun 2025).

1. Foundations and Design Principles

OSWorld-MCP builds on the Model Context Protocol (MCP), which specifies a machine-interpretable interface for exposing desktop application functionality to external agents. Each MCP server provides "tools"—named procedures described by structured JSON schemas—that agents can invoke, decoupling the agent’s action space from low-level GUI operations and mitigating brittleness to UI changes (Yan et al., 9 Jun 2025). The environment emphasizes "white-box" applications, where source code modifications allow the insertion of MCP servers and programmatic verification hooks, enabling robust task outcome assessment independent of visual layout.

Agents operate within a unified, containerized environment (Ubuntu/Windows/macOS VMs), interfacing either through traditional GUI actions (click, type, drag) or formally defined MCP tools. The hybrid benchmark structure covers tasks across education (Zotero, Anki, QGIS), creativity (OBS Studio, Blender), productivity (Zulip, Joplin, qBittorrent), and development (VS Code, FreeCAD) domains, reflecting the diversity of real-world digital workflows (Yan et al., 9 Jun 2025).

2. Benchmark Suite Construction and Tool Generation

The benchmark suite comprises 361 tasks spanning nine desktop applications (excluding eight Google Drive tasks for technical reasons). Each task specifies an end-use goal and is annotated with a difficulty level, intermediate key steps, and tool suitability (beneficial/non-beneficial). The action space for each agent at time step $s$ is $a \in \mathcal{A}_{\text{gui}} \cup \mathcal{A}_{\text{tool}}$ where $\mathcal{A}_{\text{tool}}$ consists of 158 curated MCP tools (Jia et al., 28 Oct 2025).

MCP tools are generated via a three-stage pipeline:

Automated Code Generation: LLMs (OpenAI o3, CoAct style) are prompted with task descriptions, outputting Python/Node scripts.
Filtering and Wrapping: Candidate scripts are tested for functional correctness, wrapped in a JSON-RPC MCP interface, and subjected to headless task validation.
Manual Curation: 192 additional tools are sourced from public MCP registries, duplicates removed, resulting in a final set of 158 diverse and practically applicable tools across domains (e.g., VS Code: 20 tools, LibreOffice Calc: 30, OS-level shell/fs: 33; see Table 1).

Application Domain	#Tools
VS Code	20
Google Chrome	12
LibreOffice Calc	30
LibreOffice Writer	15
LibreOffice Impress	22
OS-level (shell, fs)	33
VLC	12

133 tools were empirically demonstrated to enhance task efficiency, with support for multi-tool chaining required in 42% of relevant tasks (Jia et al., 28 Oct 2025).

3. Evaluation Protocols and Metrics

OSWorld-MCP tasks are run under strict dockerized environments enforcing clean state resets and hardware consistency (GPU passthrough as required). Each agent-task pair is executed within a maximum budget of 15 or 50 steps. Metrics are:

Task Accuracy: $\mathrm{Acc} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}\{\text{task }i\text{ succeeded}\}$
Tool Invocation Rate (TIR): fraction of tasks employing at least one MCP tool, partitioned by tool-beneficial vs. non-tool-beneficial status.
Average Completion Steps (ACS): $\mathrm{ACS}=\frac1N\sum_{i=1}^N S_i$ , with $S_i$ steps taken in task $i$ .

Key Step Completion Rate (KSCR) is used in hybrid system tests: $\text{KSCR} = (1/M)\sum_i (k_i/K_i)$ , where $k_i$ out of $K_i$ key intermediate steps are completed on run $a \in \mathcal{A}_{\text{gui}} \cup \mathcal{A}_{\text{tool}}$ 0 (Jia et al., 28 Oct 2025, Yan et al., 9 Jun 2025).

Verification eschews screenshot/file-diff approaches in favor of runtime hooks:

Dynamic Binary Instrumentation (e.g., via Frida) for compiled apps, intercepting memory-resident function calls.
Code Injection for interpreted/Electron apps, instrumenting in-app state transitions.
API-driven Queries for apps with built-in state endpoints.

4. Experimental Results and Performance Analysis

Extensive evaluations demonstrate that MCP tool invocation provides significant accuracy and efficiency gains. With a step limit of 15, OpenAI o3’s accuracy improves from 8.3% (GUI-only) to 20.4% (+MCP), and Claude 4 Sonnet from 30.2% to 35.3%. Tool invocation rates (TIR) remain low, peaking at 36.3% for Claude 4 Sonnet at 50 steps, highlighting persistent usability and composition challenges (Jia et al., 28 Oct 2025). The Hybrid configuration (GUI $a \in \mathcal{A}_{\text{gui}} \cup \mathcal{A}_{\text{tool}}$ 1MCP) in the original MCPWorld experiments generally achieves higher overall and hard-task accuracy compared to GUI-only or MCP-only baselines (Hybrid: 75.12%, GUI-only: 70.65%, MCP-only: 53.23% over 201 tasks) (Yan et al., 9 Jun 2025).

Performance improvement on “hard” tasks (+15.55% Hybrid vs. GUI-only) suggests that direct API access via MCP is essential for automating workflows not robustly addressable through GUIs alone. However, TIR’s modest values indicate a need for more sophisticated agent policies, particularly for tool selection and multi-step chaining (Jia et al., 28 Oct 2025).

5. Security Architecture and Threat Mitigation

OSWorld-MCP and the general MCP ecosystem are subject to a nontrivial threat landscape, comprising protocol design flaws (e.g., tool poisoning, shadowing), traditional web vulnerabilities (e.g., command injection, SSRF), and supply-chain threats (e.g., registry manipulation, version hijacking) (Brett, 28 Apr 2025, Wang et al., 27 Oct 2025, Li et al., 18 Oct 2025). The formal risk is modeled as $a \in \mathcal{A}_{\text{gui}} \cup \mathcal{A}_{\text{tool}}$ 2, with quantitative analysis indicating a 0.77% aggregate server hijack probability in public MCP registries.

Mitigation strategies include:

Multi-stage proactive server-side scanning (regex pattern matching, neural net classification, LLM arbitration).
Agentic auditing pipelines issuing crafted edge-case MCP requests.
Runtime policy enforcement via OPA/Rego DSLs, rate-limiting, and middleware-based signature checks.
Zero-trust registries with trust score gating.

Defensive best practices for OSWorld-MCP deployments entail cryptographic signature verification on metadata, continuous runtime monitoring, enforced tool identity disambiguation, output filtering, and robust CI/CD integration of vulnerability scanning tools (Brett, 28 Apr 2025, Wang et al., 27 Oct 2025).

6. Dataset, Population Methodology, and Ecosystem Context

Population of OSWorld-MCP leverages large-scale MCP artifact datasets, notably MCPCorpus, which aggregates ≈14,000 MCP servers and ≈300 MCP clients, with detailed 26-field annotation per artifact (Lin et al., 30 Jun 2025). The pipeline encompasses registry crawling, GitHub metadata enrichment, de-duplication, normalization, and classification—combined with metrics for activity ratio, maintenance score, adoption rate, and security vulnerability severity.

Attribute	Definition/Formula	Example Value
project_age	$a \in \mathcal{A}_{\text{gui}} \cup \mathcal{A}_{\text{tool}}$ 3 days	210
activity_ratio	$a \in \mathcal{A}_{\text{gui}} \cup \mathcal{A}_{\text{tool}}$ 4	0.17
star_rate	$a \in \mathcal{A}_{\text{gui}} \cup \mathcal{A}_{\text{tool}}$ 5	0.05
maintenance_score	as above, α/β weighted	0.75
security_vulnerability_score	$a \in \mathcal{A}_{\text{gui}} \cup \mathcal{A}_{\text{tool}}$ 6	4.7

Adoption rates are high: From 1,200 servers in 2023 to ≈13,900 by 2025 Q2. Median last commit is 14 days; 60% of servers updated in the last month. Python and TypeScript dominate, with REST/JSON the prevalent interface mechanism (Lin et al., 30 Jun 2025).

Daily CI-driven re-crawling, data normalization, and vulnerability/CVE scanning can be instrumented for continuous dataset evolution and security assurance.

7. Implications, Open Challenges, and Future Directions

OSWorld-MCP establishes the necessity of joint evaluation of tool invocation, GUI operation, and planning in digital agents and provides a public foundation for comparative benchmarking and ecosystem research (Jia et al., 28 Oct 2025). Despite demonstrable benefits, bottlenecks persist in tool discovery, invocation rate optimization, and secure supply-chain management. Open avenues include

Expanding coverage to more applications, multi-turn, and multi-agent workflows.
Advancing agent strategies for multi-tool composition.
Extending security analysis to support human-in-the-loop and policy-driven validation (Yan et al., 9 Jun 2025, Wang et al., 27 Oct 2025).

A plausible implication is that, as the ecosystem matures, formal verification of tool provenance, metadata signing, and policy-driven runtime governance will become as essential as agent benchmarking itself. The ongoing evolution of OSWorld-MCP, in conjunction with MCPCorpus and systematic security programs, is likely to further the state-of-the-art in safe, robust, and generalizable computer-use agent automation.