Papers
Topics
Authors
Recent
Search
2000 character limit reached

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

Published 28 Aug 2025 in cs.CL | (2508.20453v1)

Abstract: We introduce MCP-Bench, a benchmark for evaluating LLMs on realistic, multi-step tasks that demand tool use, cross-tool coordination, precise parameter control, and planning/reasoning for solving tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28 representative live MCP servers spanning 250 tools across domains such as finance, traveling, scientific computing, and academic search. Unlike prior API-based benchmarks, each MCP server provides a set of complementary tools designed to work together, enabling the construction of authentic, multi-step tasks with rich input-output coupling. Tasks in MCP-Bench test agents' ability to retrieve relevant tools from fuzzy instructions without explicit tool names, plan multi-hop execution trajectories for complex objectives, ground responses in intermediate tool outputs, and orchestrate cross-domain workflows - capabilities not adequately evaluated by existing benchmarks that rely on explicit tool specifications, shallow few-step workflows, and isolated domain operations. We propose a multi-faceted evaluation framework covering tool-level schema understanding and usage, trajectory-level planning, and task completion. Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench. Code and data: https://github.com/Accenture/mcp-bench.

Summary

  • The paper introduces MCP-Bench, a benchmark that evaluates LLM agents performing real-world multi-server tasks using a diverse ecosystem of 250 tools.
  • The methodology leverages an LLM-driven synthesis pipeline to generate complex tasks with dependency chains, fuzzy instructions, and quality filtering.
  • Empirical results show strong models efficiently manage multi-server workflows while exposing challenges in long-horizon planning and cross-domain orchestration.

MCP-Bench: A Rigorous Benchmark for Tool-Using LLM Agents in Real-World Multi-Server Ecosystems

Motivation and Benchmark Design

MCP-Bench addresses critical deficiencies in prior tool-use benchmarks for LLM agents by leveraging the Model Context Protocol (MCP) to expose agents to a diverse, production-grade ecosystem of 28 live servers and 250 structured tools spanning domains such as finance, science, healthcare, travel, and research. Unlike API aggregation benchmarks (e.g., ToolBench, BFCL v3), which suffer from shallow, artificially stitched workflows and limited cross-tool compatibility, MCP-Bench enables authentic multi-step, multi-domain tasks with rich input-output coupling and realistic dependency chains. The benchmark is constructed via an LLM-driven synthesis pipeline that analyzes tool I/O signatures to discover dependency chains, generates natural language instructions, and applies rigorous quality filtering for solvability and utility. Each task is further rewritten into a fuzzy, instruction-minimal variant, omitting explicit tool names and steps to stress-test agents' retrieval and planning capabilities under ambiguity. Figure 1

Figure 1: MCP-Bench connects LLM agents to real-world MCP servers exposing 250 structured tools across domains such as finance, science, and research. Tasks are generated via LLM-based synthesis, then executed by the agent through multi-turn tool invocations. Each execution trajectory is evaluated using a combination of rule-based checks and LLM-as-a-Judge scoring, assessing agent performance in tool schema understanding, multi-hop planning, and real-world adaptability.

MCP Server Ecosystem and Task Synthesis

The MCP server ecosystem in MCP-Bench is highly heterogeneous, with servers ranging from single-tool endpoints (e.g., FruityVice, Movie Recommender) to complex platforms such as BioMCP (35 tools), Scientific Computing (26 tools), and Medical Calculator (22 tools). Domains include media, finance, research, software development, travel, health, and more, ensuring broad coverage of real-world tool-use scenarios. Figure 2

Figure 2

Figure 2: Overview of MCP server ecosystem used in the benchmark, illustrating the diversity of domains and tool distributions.

Task synthesis proceeds in three stages: (1) dependency chain discovery, (2) automatic quality filtering, and (3) task description fuzzing. Dependency chains are identified by analyzing tool input/output relationships, enabling the construction of tasks with deep sequential and parallel dependencies, cross-server orchestration, and multi-goal objectives. Quality filtering ensures that only tasks with high solvability (≥9/10) and practical utility (≥5/10) are retained. Fuzzy task variants are generated to require agents to infer tool selection and execution strategies from context, rather than explicit instructions, thereby testing semantic retrieval and planning under ambiguity.

Formalization and Agent Execution Paradigm

MCP-Bench formalizes tool-using agent evaluation as a structured extension of the POMDP framework, with each task represented as (S,A,O,T,R,U,Σ)(\mathcal{S}, \mathcal{A}, \mathcal{O}, T, R, \mathcal{U}, \Sigma), where Σ\Sigma is the set of available MCP servers and T=iTi\mathcal{T} = \bigcup_i \mathcal{T}_i is the global tool set. Agents operate in a multi-round decision process, generating plans, executing parallel tool calls, compressing observations, and updating internal state at each round. The process continues for up to TmaxT_{\max} rounds (20 in MCP-Bench) or until termination is signaled. This paradigm supports both intra-server and cross-server workflows, with explicit support for parallel execution and dependency management.

Evaluation Framework

MCP-Bench employs a dual evaluation framework:

  • Rule-based metrics: Tool name validity, schema compliance, and execution success are computed from execution traces, penalizing hallucinations, malformed requests, and runtime failures.
  • LLM-as-a-Judge scoring: Strategic quality is assessed across three axes—task completion, tool usage, and planning effectiveness—using structured rubrics. Each axis is decomposed into sub-dimensions (e.g., fulfillment, grounding, appropriateness, parameter accuracy, dependency awareness, parallelism/efficiency), scored on a strict percentage-based system. Prompt shuffling and score averaging are applied to mitigate scoring bias and improve robustness.

Ablation studies demonstrate that prompt shuffling and score averaging reduce the coefficient of variation among LLM judges (from 16.8% to 15.1%) and improve human agreement scores (from 1.24 to 1.43 out of 2), substantiating the reliability of the evaluation pipeline.

Empirical Results and Analysis

Twenty advanced LLMs were evaluated on 104 MCP-Bench tasks, spanning both single-server and multi-server settings. Schema understanding and valid tool naming have largely converged across models, with top-tier systems (o3, gpt-5, gpt-oss-120b, qwen3-235b-a22b-2507, gpt-4o) exceeding 98% compliance. However, substantial gaps persist in higher-order reasoning and planning:

  • Overall scores: gpt-5 (0.749), o3 (0.715), and gpt-oss-120b (0.692) lead, reflecting robust planning and execution. Smaller models (e.g., llama-3-1-8b-instruct, 0.428) exhibit weak dependency awareness and parallelism.
  • Scalability: Strong models maintain stable performance across increasing server counts, while weaker models degrade, especially in multi-server scenarios. The main sources of decline are dependency management and parallel orchestration.
  • Resource efficiency: Smaller models require more rounds and tool calls (e.g., llama-3-1-8b-instruct: 17.3 rounds, 155.6 calls/task), whereas strong models (e.g., o3, gpt-4o) achieve comparable success with leaner execution (6–8 rounds, <40 calls/task).

These results indicate that basic execution fidelity is no longer the bottleneck; the key differentiators are long-horizon planning, cross-server orchestration, and robust adaptation to complex, ambiguous workflows.

Implementation Considerations

MCP-Bench is designed for extensibility and reproducibility. The benchmark is open-source (https://github.com/Accenture/mcp-bench), with all MCP server endpoints, tool schemas, and task generation pipelines documented. The evaluation framework is modular, supporting integration with new LLMs and custom judge models. For practical deployment, agents must implement robust schema validation, error handling, and context compression to manage long tool outputs and avoid context window overflow. The multi-round planning paradigm requires agents to maintain persistent state and reason over intermediate results, with explicit support for parallel execution and dependency tracking.

Implications and Future Directions

MCP-Bench exposes persistent weaknesses in current LLM agents, particularly in multi-hop planning, cross-domain orchestration, and evidence-based reasoning under fuzzy instructions. The benchmark provides a rigorous platform for advancing agentic capabilities, with implications for real-world deployment in domains such as healthcare, finance, scientific research, and travel planning. Future developments may include:

  • Expansion to additional MCP servers and domains, increasing task diversity and complexity.
  • Integration of multimodal tools (e.g., vision, audio) and real-time data sources.
  • Development of advanced agent architectures with explicit memory, hierarchical planning, and meta-reasoning capabilities.
  • Refinement of evaluation metrics to capture nuanced aspects of strategic reasoning, error recovery, and user-centric adaptability.

Conclusion

MCP-Bench establishes a new standard for evaluating tool-using LLM agents in realistic, ecosystem-based scenarios. By bridging the gap between isolated API benchmarks and production-grade multi-server environments, MCP-Bench enables comprehensive assessment of agentic reasoning, planning, and tool-use capabilities. The empirical results highlight that while schema understanding and execution fidelity have converged, robust planning and cross-domain orchestration remain open challenges. MCP-Bench provides the necessary infrastructure and methodology for driving progress in agentic LLM research and deployment.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Knowledge Gaps

Unresolved Gaps and Limitations

Below is a concise list of concrete knowledge gaps, limitations, and open questions that remain unresolved and could guide future research:

  • Benchmark stability and reproducibility: Live MCP servers can change over time (data drift, outages, rate limits, schema updates), but no server versioning, snapshotting, or replay logs are provided to guarantee reproducible runs.
  • Judge reliability and bias: The LLM-as-a-Judge (o4-mini) is a single-model evaluator; there is no inter-judge agreement analysis, cross-family judge comparison, or correlation with human judgments to validate scoring reliability and fairness.
  • Circularity and contamination risk: The same vendor family is used for task synthesis and judging, and many evaluated systems are from that vendor; the impact of this circularity on outcomes is not quantified.
  • Underspecified metric definitions: The paper claims rule-based checks for “dependency order/compliance,” but only defines three metrics (tool validity, schema compliance, execution success) formally; dependency compliance and parallelism efficiency lack explicit, reproducible formulas.
  • Overall score construction: How the multiple axes are combined into the final “Overall Score” (weights, normalization, aggregation) is not precisely defined, hindering interpretability and replication.
  • Run-to-run variance and statistical testing: Model sampling settings, seeds, and multi-run variance for agents (not just judges) are not reported; no confidence intervals or significance tests are presented.
  • Effect of compression policy: The summarization step π_compress can drop crucial details; there is no ablation on different compression strategies, lossiness, or its impact on planning and grounding.
  • Parallelism evaluation vs. capability: “Parallelism and efficiency” is scored by an LLM judge, but the agent runtime appears sequential; it is unclear whether true concurrency is supported or how efficiency is objectively measured.
  • Tool parameter correctness beyond schema: Rule-based checks only ensure format/schema validity; semantic correctness of parameters (e.g., correct units, ranges, constraints) is left to the judge with no grounded verification.
  • Limited task diversity and scale: Only 104 tasks (with max 20 steps) may be insufficient to expose long-horizon weaknesses; the benchmark does not explore very long plans, dynamic replanning, or workflows with more than three servers.
  • Representativeness of domains and operations: Many real-world scenarios (authentication, payments, irreversible side-effects like booking, rate limiting, quotas) are absent or underrepresented, limiting ecological validity.
  • Fuzzy-instruction fidelity: Fuzzy tasks preserve numeric details; real users often omit key parameters. The benchmark does not measure the agent’s ability to elicit missing information via clarification questions.
  • Real-user task realism: Tasks are LLM-synthesized with limited human curation; there is no evaluation on user-sourced tasks or an analysis of synthesis-induced biases.
  • Distractor selection rigor: Adding 10 distractor servers per task increases difficulty, but their semantic closeness to the target tools and the effect of varying distractor count/quality are unstudied.
  • Generalization to unseen tools/servers: There is no leave-one-server-out or unseen-schema evaluation to assess transfer beyond the specific MCP servers included.
  • Robustness to tool/output hazards: The benchmark does not test adversarial conditions (malformed schemas, prompt-injected outputs, noisy/contradictory results, flaky endpoints) or recovery strategies after failures.
  • Safety, privacy, and compliance: No metrics or scenarios test safe tool use (e.g., medical decision boundaries, PII handling, regulatory constraints) or the agent’s ability to refuse unsafe requests.
  • Cross-lingual robustness: Tasks and tool descriptions are English-centric; the benchmark does not evaluate multilingual instructions or localization differences in tool outputs.
  • Alternative valid trajectories: Judges are given the unfuzzed task and dependency analysis unavailable to agents, risking bias toward a single canonical plan; there is no explicit accommodation for scoring correct, alternative execution paths.
  • Agent design confounders: Only model families vary; agentic design choices (retrievers, memory modules, planners, tool-selection strategies) are not systematically ablated, making it hard to attribute gains to model vs. agent architecture.
  • Error taxonomy and qualitative insights: There is no systematic categorization of failure modes (e.g., retrieval errors, parameterization mistakes, grounding lapses), which limits actionable guidance for improving agents.
  • Temporal and geographic sensitivity: Tasks involving weather, maps, or finance are time/locale-sensitive; the benchmark does not control or report these contextual factors.
  • Accessibility and maintenance: Reliance on external “production-grade” servers raises availability and licensing concerns; long-term maintenance plans (server deprecation, replacements, mirrored endpoints) are not described.
  • Cost and latency: The benchmark does not measure computational cost, latency, or efficiency trade-offs, which are critical for practical deployment.
  • Multi-agent or tool-learning settings: The benchmark focuses on a single agent and fixed tool schemas; it does not explore collaborative agents, on-the-fly tool learning, or schema induction from sparse documentation.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

GitHub

Tweets

Sign up for free to view the 9 tweets with 104 likes about this paper.