Papers
Topics
Authors
Recent
Search
2000 character limit reached

W&D:Scaling Parallel Tool Calling for Efficient Deep Research Agents

Published 7 Feb 2026 in cs.AI | (2602.07359v1)

Abstract: Deep research agents have emerged as powerful tools for automating complex intellectual tasks through multi-step reasoning and web-based information seeking. While recent efforts have successfully enhanced these agents by scaling depth through increasing the number of sequential thinking and tool calls, the potential of scaling width via parallel tool calling remains largely unexplored. In this work, we propose the Wide and Deep research agent, a framework designed to investigate the behavior and performance of agents when scaling not only depth but also width via parallel tool calling. Unlike existing approaches that rely on complex multi-agent orchestration to parallelize workloads, our method leverages intrinsic parallel tool calling to facilitate effective coordination within a single reasoning step. We demonstrate that scaling width significantly improves performance on deep research benchmarks while reducing the number of turns required to obtain correct answers. Furthermore, we analyze the factors driving these improvements through case studies and explore various tool call schedulers to optimize parallel tool calling strategy. Our findings suggest that optimizing the trade-off between width and depth is a critical pathway toward high-efficiency deep research agents. Notably, without context management or other tricks, we obtain 62.2% accuracy with GPT-5-Medium on BrowseComp, surpassing the original 54.9% reported by GPT-5-High.

Summary

  • The paper introduces the W&D framework that enables parallel tool calling per reasoning step, reducing latency and token usage.
  • It demonstrates significant efficiency gains with a 35.9% reduction in processing cost and improved accuracy across multiple benchmarks.
  • Dynamic scheduling strategies, especially descending scheduler, enhance performance via redundant verification and effective query decomposition.

Scaling Parallel Tool Calling for Efficient Deep Research Agents: Analysis of the W&D Framework

Overview and Motivation

The paper "W&D: Scaling Parallel Tool Calling for Efficient Deep Research Agents" (2602.07359) introduces a novel approach for enhancing agentic systems tasked with complex, multi-step research and information-seeking workflows. While prior work predominantly focuses on scaling depth — increasing the number of sequential reasoning steps and tool calls — this study investigates the complementary dimension of width: issuing multiple tool calls simultaneously within a single reasoning step. The authors propose the Wide and Deep (W&D) agent framework, systematically evaluating the impact of parallel tool calling on both proprietary and open-source LLMs across challenging deep research benchmarks. The methodology is grounded in rigorous empirical analysis, including ablation studies on tool call schedulers and case studies exploring the roots of accuracy gains. Figure 1

Figure 1: (a) Visual comparison of sequential (single) versus parallel tool calling traces in deep research agents. (b) Performance and iteration statistics showing consistent gains and efficiency from increasing the number of parallel tool calls per step.

Methodology: Parallel Tool Calling Formalization and Scheduling

The W&D agent extends traditional sequential agent traces by formalizing a parallel agent trace τpar\tau_{\text{par}}, in which the LLM generates a set Apart\mathcal{A}^{\text{par}_t} of mm concurrent tool calls per step. These tool calls are executed in parallel, reducing wall-clock latency and token consumption due to fewer LLM reasoning steps per tool call.

To maintain precise control over tool call counts, the agent is prompted each iteration with a user message specifying the number of calls required, a strategy found to outperform static system instructions. The framework is fully compatible with modern proprietary LLMs (GPT-5, Gemini 3.0, Claude 4.5) and open-source models (DeepSeek-V3.2, Qwen3-235B), evaluated in the MCP-Universe agent environment with standardized tool access (search, scraping, Python interpreter).

Empirical Results: Accuracy and Efficiency Gains

Parallel tool calling — scaling width — yields statistically significant improvements in accuracy and efficiency across all tested benchmarks. On the BrowseComp dataset, parallel tool calling with three tools per turn yields 68% accuracy, reducing average turns required (23.8 vs 45.7 for single tool calling) and lowering both computation and API costs; for example, a 35.9% reduction in total processing cost and a 40.6% reduction in wall-clock time are achieved for 100 samples. Figure 2

Figure 2: BrowseComp performance: accuracy versus average number of turns (left), and accuracy versus number of parallel tools per turn (right); parallel tool calling reduces turn count and increases accuracy.

Figure 3

Figure 3: Top row: GPT-5-medium performance scaling across multiple benchmarks with parallel tool calling. Bottom row: Cross-model analysis in BrowseComp validating parallel tool calling benefits.

Crucially, these benefits generalize to multiple benchmarks (BrowseComp, HLE, GAIA) and models, although the effect is less pronounced in open-source models, implying the need for further training optimization for agentic behaviors.

A contradictory finding is highlighted: GPT-5-Medium achieves 62.2% accuracy on the full BrowseComp set using parallel tool calling, outperforming the original GPT-5-High (54.9%) despite lacking context management or specialized tricks.

Analysis: Sources of Accuracy Gains in Parallel Tool Calling

Manual agent trace inspection identifies three key contributing factors to the performance improvement:

  • Credible Information Aggregation: Parallel tool calling expands the search scope, allowing reasoning steps to compare sources and select authoritative information, mitigating erroneous reliance on a single API or database.
  • Redundancy and Verification: Multiple tool calls serve as mutual verification points; if scraping or summarization tools return inconsistent data, the agent is able to detect unreliability and reinitiate searches, thus avoiding propagation of hallucinated or faulty data.
  • Query Decomposition: The agent can break complex queries into simpler, parallelizable sub-queries, improving retrieval effectiveness for constraint-heavy or multi-faceted information requests.

Tool Call Scheduling: Balancing Exploration and Exploitation

Beyond constant-width parallel tool calling, the paper explores dynamic tool call scheduling strategies, denoting the number of calls per turn as mtm_t. Tested strategies include constant, ascending, descending, and automatic (LLM-decided) schedulers.

Key results:

  • Descending scheduler (explore-then-exploit, more calls early, fewer later) achieves the highest accuracy (74% with 23.5 average turns), outperforming both constant and ascending strategies.
  • The Automatic scheduler (LLM-decided) does not exceed descending, indicating current LLMs lack skill in autonomous optimal tool call balancing. Figure 4

    Figure 4: Average number of tool calls per turn for different tool call schedulers, illustrating adherence to specified strategies and effect on agent behavior.

Practical and Theoretical Implications

The study demonstrates that scaling tool execution width via parallel tool calling is a critical axis for agentic optimization, often overlooked compared to context or reasoning depth scaling. The efficiency gains (reduced API cost, latency) are substantial, and the increased accuracy is rooted in redundancy, source verification, and fine-grained decomposition — mechanisms not inherent to the sequential agent design.

Practically, integrating parallel tool execution into agent frameworks is straightforward, requiring only prompting modifications rather than complex orchestration or sub-agent architecture. The demonstrated performance gains suggest parallel tool calling should be adopted as a default for high-throughput, accuracy-sensitive agentic use cases.

Theoretically, the inability of current LLMs to autonomously optimize the width-depth trade-off highlights a gap in agentic control learning, potentially addressable through reinforcement learning or dedicated scheduling objective functions. As proprietary and open-source LLMs evolve, training paradigms should include agentic tool call scheduling and reasoning coordination, particularly for multi-tool and multi-path environments.

Conclusion

The W&D agent framework offers a robust empirical and theoretical foundation for scaling parallel tool calling in deep research agents. Parallel execution not only enhances efficiency but also accuracy, driven by improved credibility verification, redundancy against tool failures, and query decomposition. Dynamic tool call scheduling, especially explore-then-exploit descent, further boosts agent effectiveness, though current LLMs require explicit prompting to leverage these strategies. Future research should focus on autonomous scheduling and reasoning coordination, laying the groundwork for highly efficient, adaptive agentic systems in complex research tasks.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview

This paper introduces a new way to make “deep research agents” (AI systems that look things up on the web, think in steps, and write answers) faster and more accurate. The idea is called “Wide and Deep,” which means:

  • Deep: the agent can take many steps to think and use tools.
  • Wide: the agent can call several tools at the same time in one step (parallel tool calling), instead of just one tool per step.

The authors show that widening the agent—by making multiple tool calls in parallel—helps it find correct answers more quickly and cheaply, not just by adding more thinking steps.

Key Questions

The paper asks simple but important questions:

  • If the agent makes several tool calls at the same time, does it get better answers and finish faster?
  • How many parallel tool calls per step is best?
  • Why does parallel tool calling improve accuracy?
  • Does this approach work across different AI models and different types of research problems?
  • Can we plan the number of tool calls per step smartly, instead of keeping it fixed?

How They Did It (Methods)

Think of the agent like a careful student doing a research project:

  • It “reasons” (plans what to do next).
  • It uses “tools,” like:
    • A search engine to find pages.
    • A scraper that reads a webpage and summarizes the part you care about.
    • A code tool (a Python interpreter) to do calculations or process data.

To test their ideas, the authors:

  • Used a standard agent setup (MCP-Universe) with those tools.
  • Tested on three tough research benchmarks:
    • BrowseComp: challenges that require browsing the web.
    • HLE (Humanity’s Last Exam): difficult text-only questions.
    • GAIA: general assistant tasks (text-only subset).
  • Controlled how many tools the agent must call in each step (for example, 1, 2, 3, 5, or 8 tools at once). They did this by giving clear instructions in the agent’s prompt, so the agent consistently made the requested number of tool calls.
  • Measured accuracy (how often answers were correct), number of turns (steps), time, and cost.
  • Tried different “schedules” for tool calls per step, like always 3 tools, starting with 3 then reducing to 1, starting with 1 then increasing to 3, and letting the AI decide on its own.
  • Read through agent logs to see why parallel calls helped.

What They Found (Results)

Parallel tool calling made a big difference:

  • Better accuracy with fewer steps:
    • On BrowseComp, using 3 parallel tool calls per step improved accuracy and reduced the number of turns needed.
    • Example: To reach about 66–68% accuracy across 100 tasks, single-tool calling cost $102.5 and took 1,522.6 seconds. With 3 parallel tools per step, the agent reached 68% accuracy for$65.7 and 904.2 seconds. That’s roughly a 36% cost reduction and a 41% time reduction.
  • Works across different models and benchmarks:
    • The gains showed up with multiple advanced AI models (like GPT-5-Medium, Gemini, and Claude) on different datasets.
    • On the full BrowseComp set, GPT-5-Medium reached 62.2% accuracy—higher than the reported 54.9% from GPT-5-High—without fancy “context management” tricks.
    • Open-source models saw smaller gains, which suggests they could be trained better to use parallel tool calls.
  • Smarter scheduling helps:
    • A “Descending” schedule (many tool calls at the beginning, fewer later) performed best, reaching 74% accuracy on BrowseComp, beating “always 3 tools” (68%).
    • An “Ascending” schedule (starting low, increasing later) did worst.
    • Letting the AI decide the number of calls (“Automatic”) did not beat the Descending schedule.

Why Parallel Tool Calling Improves Accuracy

These patterns explain the accuracy boost:

  • Wider exploration leads to better sources: Calling multiple searches at once surfaces more viewpoints and lets the agent pick more trustworthy sources (like official UN reports) instead of risky, unofficial ones.
  • Redundancy helps catch errors: If one tool fails or returns a made-up answer, different parallel calls can disagree, signaling the agent to re-check or search again.
  • Breaking complex queries into simpler pieces: Instead of one overloaded search, the agent issues several simple, focused queries in parallel, which makes search engines return more relevant results.

Why It Matters (Implications)

This work shows that the “width” of agent actions—how many tools it uses at once—matters as much as the “depth” (how many steps it takes). The practical impact is clear:

  • Faster, cheaper, and more accurate research agents: Useful for students, journalists, analysts, and scientists who need to gather and verify information quickly.
  • Better strategy design: Start broad (many parallel tool calls to explore), then narrow down (fewer calls to focus), which matches how good human researchers work.
  • Future training: Teach AI models to choose the right number of tool calls automatically, balancing speed, cost, and accuracy. This could unlock even stronger “deep research” abilities without complicated multi-agent systems.

In short, doing more at once—when done thoughtfully—helps research agents find the right answers faster and more reliably.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The following are the key unresolved issues that future research could address to strengthen, generalize, and make more reliable the proposed parallel tool calling approach for deep research agents:

Experimental scope and evaluation rigor

  • Limited evaluation scope: most BrowseComp and HLE results are on the first 100 samples; GAIA and HLE are restricted to text-only subsets. Assess generalization on full datasets and multimodal settings, and report per-task-type breakdowns.
  • Lack of statistical significance: no confidence intervals, variance across runs, or significance tests; search results are non-deterministic. Run multiple seeds/reruns with randomized search ordering and report variability.
  • Metrics beyond accuracy: no systematic evaluation of citation quality, source credibility, justification quality, or error taxonomy. Add metrics for evidence quality, consistency, and calibration of confidence.
  • Incomplete cost and latency accounting: cost/latency reductions are reported on small subsets without comprehensive token- and wall-clock accounting across all components (LLM, search, scraping/summarization, retries). Provide a detailed cost model including tool latencies, failure retries, and rate-limit backoffs.

Parallel tool calling mechanics and aggregation

  • Aggregation strategy is implicit: parallel outputs are simply fed back to the LLM; no explicit deduplication, confidence scoring, contradiction detection, or citation ranking. Develop and evaluate structured aggregation modules (e.g., source voting, cross-checking, credibility weights).
  • When width hurts performance: accuracy declines at high width for large max-turn limits, but no systematic characterization of failure modes (e.g., distraction, context dilution, duplicate queries). Quantify conditions where extra parallel calls reduce quality and why.
  • Quality of tool arguments at higher width: no analysis of whether argument quality degrades as m increases (e.g., redundant or poorly decomposed queries). Measure per-call uniqueness, overlap, and downstream utility as a function of m.
  • Interaction with context size: parallel calls produce many observations per turn; the paper claims no context management but does not analyze truncation effects or token budgets. Evaluate context overflow, truncation policies, and their impact on correctness.

Scheduler design and learning

  • Heuristic scheduler only: Descending outperforms others, but the “Automatic” scheduler is a simple prompt rule. Explore learned schedulers (bandits, RL, dynamic programming) that optimize width-depth trade-offs conditioned on task state and budget.
  • Generalization of scheduler gains: scheduler comparison is shown on BrowseComp only. Test whether Descending (or learned policies) generalize to HLE/GAIA and to different toolsets and models.
  • Progress estimation is uncalibrated: the Automatic strategy relies on self-reported progress without calibration. Investigate progress predictors trained on outcome data and study their calibration and effect on scheduling decisions.
  • Budget-aware control: scheduling does not explicitly incorporate API budgets, rate limits, or expected tool latency distributions. Design schedulers that optimize under explicit cost/latency constraints.

Tools, pipelines, and environment dependencies

  • Single summarization model in scraping: reliance on Gemini-2.5-Flash for extraction introduces an unmeasured hallucination risk; no ablation of different summarizers, deterministic settings, or non-LLM extractors. Evaluate alternative extractors and quantify hallucination rates.
  • Tool diversity and composition: experiments focus on search/scrape (and code for HLE) without exploring richer toolsets (structured APIs, knowledge bases) or parallel mixes (e.g., search+scrape in the same step). Study cross-tool composition patterns and their effects.
  • Search engine dependency: only Google (via Serper) is used. Test Bing, OpenSearch backends, and domain-specific engines to assess robustness to engine choice and ranking.
  • Vendor-specific parallel-call semantics: parallel function-calling behaviors differ across providers. Document and compare vendor semantics, adherence, and failure modes to ensure reproducibility across platforms.

Robustness, reliability, and reproducibility

  • Adherence to call counts: the paper states calls are “very close” to specified m but provides no adherence rates or penalties for deviations. Quantify adherence, over/under-calling frequency, and its impact on outcomes.
  • Tool failure handling: case studies show resilience to scraping failures, but there is no quantitative evaluation of failure/retry policies (HTTP errors, captchas, rate limits). Measure failure rates, retry strategies, and robustness under adverse conditions.
  • Web drift and time sensitivity: no controls for changing web content over time. Include time-stamped caches or snapshot-based evaluation to separate algorithmic improvements from content variability.
  • Reproducibility of pipelines: many moving parts (LLM, search, scraping LLM, interpreter). Release detailed configs, versions, prompts, temperature settings, and seeds; quantify run-to-run variance.

Model dependence and training questions

  • Proprietary vs. open-source gap: open-source models show small gains with parallel calls, but the causes (training data, function-calling competence, planning) are not analyzed. Diagnose failure modes and identify training data/methods to improve parallel tool planning in open models.
  • No training to aggregate or schedule: the approach is purely prompted; no finetuning to improve query decomposition, aggregation, or scheduling. Study supervised/RL training signals (e.g., source reliability, redundancy penalties) to learn these skills.
  • Combining with parallel reasoning or multi-agent orchestration: the method is compared qualitatively but not combined with these alternatives. Explore hybrid systems that jointly parallelize tool calls and reasoning paths with principled aggregation.

Domain coverage and generalization

  • Beyond web information seeking: the method is not tested on domains like code synthesis with API calls, scientific literature querying, or structured retrieval tasks. Evaluate in non-web, API-centric, and domain-specific tool ecosystems.
  • Multimodal tasks are excluded: text-only subsets are used for HLE/GAIA; parallel tool calling for vision+text (e.g., document parsing, charts) remains an open area.

Efficiency and systems considerations

  • Parallelism limits and system constraints: experiments explore width up to 8 but do not quantify backend limits (rate limits, concurrent connections) or diminishing returns vs. infrastructure constraints. Characterize throughput/latency ceilings and adaptive throttling.
  • Token accounting per step: claims about amortized reasoning cost lack per-turn token breakdowns as m varies. Provide token-per-turn and total-token analyses, including tool-result ingestion costs.
  • Throughput vs. latency trade-offs: end-to-end latency is reported, but not the impact on throughput under shared budgets or batch settings. Evaluate in multi-task queues with rate-limited APIs.

Safety, compliance, and ethics

  • Compliance with site policies: parallel scraping may violate robots.txt or ToS; there is no discussion of compliance mechanisms or throttling policies. Define and evaluate policy-aware crawling strategies.
  • Misinformation and bias: more sources do not guarantee better truthfulness. Develop automated source credibility checks, bias detection, and adversarial robustness tests for malicious or SEO-gamed pages.
  • Code interpreter risks: for HLE, the interpreter is available but security/sandboxing and misuse potentials are not discussed. Specify sandboxing, resource limits, and audit logs.

These gaps suggest concrete directions: broaden and deepen evaluation with robust metrics and statistics; build explicit aggregation and verification modules; learn dynamic scheduling policies; diversify and ablate tool pipelines; stress-test robustness and reproducibility; and extend to new domains and modalities under realistic system and ethical constraints.

Practical Applications

Immediate Applications

Below are practical use cases that can be deployed now using the paper’s findings on parallel tool calling, width-depth trade-offs, and tool-call schedulers. Each item includes sector alignment, potential tools/products/workflows, and feasibility notes.

  • Enterprise market and competitive intelligence assistant
    • Sector: Finance, Enterprise, Software
    • What it does: Automates market scans and competitor profiling by issuing parallel searches and scrapes across company sites, filings, news, and analyst reports, then verifies and consolidates findings.
    • Tools/workflows: Search + scraping (e.g., Serper + Jina), Gemini-Flash summarization, “Descending” scheduler (explore-then-exploit), provenance logging; integrates with MCP-Universe or vendor APIs (OpenAI/Gemini/Claude).
    • Assumptions/dependencies: Parallel function calling support; API rate limits and quotas; access to paywalled sources via licenses; governance for source credibility; internal review for high-stakes outputs.
  • Legal/compliance research copilot
    • Sector: Legal, Policy, Risk
    • What it does: Parallel queries across statutes, case law, regulatory guidance; cross-verifies tool outputs to reduce reliance on single (possibly faulty) retrievals.
    • Tools/workflows: Query decomposition into narrower facets; redundant extractions for verification; “Descending” scheduler to front-load exploration; report templating.
    • Assumptions/dependencies: Licensed access (e.g., Westlaw/Lexis); strict human-in-the-loop; citation and provenance requirements; jurisdictional coverage; vendor parallel-call support.
  • Scientific and clinical literature review assistant
    • Sector: Healthcare, Academia
    • What it does: Conducts systematic evidence scans (PubMed, guidelines, preprints, registries) with parallel queries and redundancy checks to avoid hallucinated summaries.
    • Tools/workflows: Search + scraping + code interpreter (basic meta-analysis scripting), query decomposition over PICO facets; source credibility filters; staged exploration then narrowing.
    • Assumptions/dependencies: Access to biomedical databases; qualified expert validation; caution for clinical decision-making; data-use agreements and ethics compliance.
  • Customer support knowledge retrieval accelerator
    • Sector: Software, Customer Experience
    • What it does: Parallel searches across internal docs, tickets, FAQs, and forums; uses redundancy to detect unreliable tool outputs; returns consolidated answers faster.
    • Tools/workflows: Internal search connectors; scraping of docs; “Descending” scheduler to reduce time-to-first-correct answer; answer provenance in support portals.
    • Assumptions/dependencies: Access to internal repositories; concurrency controls; PII handling; content freshness mechanisms.
  • Investigative journalism research tool
    • Sector: Media
    • What it does: Parallel source gathering (news archives, official reports, databases); query decomposition for complex constraints; cross-source verification for credibility.
    • Tools/workflows: Explore-then-exploit scheduling; source scoring; automatic citation collection; MCP-compatible single-agent integration.
    • Assumptions/dependencies: Respect robots.txt and site terms; editorial standards; fact-check workflows; mixed paywalled/public sources.
  • Government procurement due diligence and policy analysis
    • Sector: Public Sector
    • What it does: Parallel OSINT on vendors, contracts, performance reports; verification via redundant tool calls to avoid acting on faulty extractions.
    • Tools/workflows: Search + scraping; structured report generation; “Descending” scheduler; audit trail for each source and tool call.
    • Assumptions/dependencies: Legal compliance (FOIA, open-data licenses); human oversight; access to public procurement portals; vendor API quotas.
  • E-commerce product comparison and buying assistant
    • Sector: Consumer, Retail
    • What it does: Parallel scraping of specification pages and reviews; query decomposition for price, warranty, and feature constraints; verification for inconsistent tool results.
    • Tools/workflows: Search + scraping; Gemini-Flash summarization; exploration-heavy early turns; final aggregation with transparent source links.
    • Assumptions/dependencies: Anti-scraping policies; dynamic site rendering; possible need for retailer APIs; privacy and consent for review data.
  • Travel planning and itinerary optimization
    • Sector: Consumer, Travel
    • What it does: Parallel searches for flights, hotels, attractions; decomposes queries (dates, neighborhoods, budgets) to improve recall and quality; verifies conflicting outputs.
    • Tools/workflows: Travel APIs + scraping; “Descending” scheduler; budget-aware prompts with countdown messages to control steps and costs.
    • Assumptions/dependencies: API access and rate limits; variability in pricing data; need for live availability; compliance with terms of service.
  • Software engineering research and knowledge mining
    • Sector: Software, DevOps
    • What it does: Parallel querying across code search, issues, PRs, design docs to identify patterns or prior art; verifies tool outputs before proposing solutions.
    • Tools/workflows: Connectors to code hosting and ticketing systems; query decomposition by component/module; provenance tags; “Descending” scheduler for efficient triage.
    • Assumptions/dependencies: Repository access permissions; confidentiality; integration maintenance; token budget awareness.
  • Data operations and web ingestion pipelines
    • Sector: Data Engineering
    • What it does: Parallel web scrapes and extractions for ETL; redundancy to detect extraction failures/hallucinations; improved throughput and lowered latency/costs.
    • Tools/workflows: Parallel tool-call orchestrator; verification layer; schema mapping; explore early, exploit later schedule.
    • Assumptions/dependencies: Concurrency controls; deduplication; licensing-compliant ingestion; observability and retry logic.
  • Agent cost and latency optimization in AI tooling
    • Sector: AI Platforms, MLOps
    • What it does: Reduces turns and token usage with parallel tool calling; integrates “Descending” scheduler to cut wall-clock time and API costs (e.g., ~40% time and ~36% cost reductions observed in BrowseComp scenarios).
    • Tools/workflows: Tool Call Scheduler SDK; countdown prompts; per-turn budget visibility; parallel-call caps per step.
    • Assumptions/dependencies: Vendor support for parallel calls; robust monitoring (tokens, latency, error rates); safe fallbacks when tools fail.
  • Education research assistant for students and librarians
    • Sector: Education
    • What it does: Parallel queries across academic search engines and repositories; uses redundancy to avoid faulty extractions; compiles annotated bibliographies.
    • Tools/workflows: Search + scraping; citation manager integration; explore-first scheduling; plagiarism detection integration.
    • Assumptions/dependencies: Access to academic indexes; institutional licenses; academic integrity policies; human supervision for grading or instruction.

Long-Term Applications

These use cases require additional research, scaling, or development (e.g., multimodality, RL training, standardization, or regulatory approvals) before broad deployment.

  • Autonomic width–depth scheduler via learning
    • Sector: AI Tooling, Research
    • What it could do: Train LLM agents (e.g., with RL) to dynamically decide the number of tool calls per turn (explore-then-exploit) instead of relying on fixed heuristics, optimizing cost/latency/accuracy jointly.
    • Tools/products: “Scheduler RL” module; reward shaping for credibility and efficiency; benchmarking harnesses (BrowseComp/HLE/GAIA).
    • Assumptions/dependencies: Suitable training data and signals; model updates to support adaptive control; safety constraints to prevent abuse of parallel calls.
  • Multimodal parallel tool calling for complex documents
    • Sector: Healthcare, Legal, Media
    • What it could do: Parallel parsing of PDFs, tables, images, audio, and video; cross-modal verification; improved evidence synthesis and compliance checks.
    • Tools/products: Multimodal LLM integration; document parsers; table extraction; citation tracking; provenance across modalities.
    • Assumptions/dependencies: Robust multimodal models; domain adapters; consent and licensing; evaluation datasets covering multimodality.
  • Parallel tool calling protocol standardization
    • Sector: Software, AI Infrastructure
    • What it could do: Define a vendor-agnostic spec (building on MCP) for parallel tool calls, scheduling, provenance, and verification, enabling consistent interoperability across agent frameworks.
    • Tools/products: “Parallel Tool Calling Protocol” (PTCP); open-source SDKs; conformance tests.
    • Assumptions/dependencies: Broad vendor adoption; governance and versioning; security and audit requirements.
  • Source credibility scoring and provenance ledger
    • Sector: Media, Policy, Academia
    • What it could do: Systematically score sources using redundancy, cross-consistency, and reputation; maintain a provenance ledger for auditability and reproducibility.
    • Tools/products: Credibility scoring engine; provenance store; UI for auditors/editors; integration with report generation.
    • Assumptions/dependencies: Trust frameworks; community-agreed scoring criteria; mitigation for bias; data retention policies.
  • Automated systematic review and meta-analysis pipelines
    • Sector: Academia, Healthcare
    • What it could do: End-to-end systematic reviews with parallel search, verification, data extraction, and code-based meta-analysis; reproducibility at scale.
    • Tools/products: ReviewOps platform; extraction validators; statistical modules; registered protocols and audit trails.
    • Assumptions/dependencies: Human oversight; registry integration (e.g., PROSPERO); ethical approvals; domain-specific guidelines.
  • Scalable OSINT automation with cross-lingual support
    • Sector: Security, Government
    • What it could do: Parallel multilingual queries across public datasets and platforms; verification to reduce false positives; risk scoring.
    • Tools/products: Cross-lingual search/scrape; translation and normalization; provenance-aware dashboards.
    • Assumptions/dependencies: Legal/ethical constraints; multilingual model quality; platform terms of service; robust deconfliction processes.
  • Enterprise knowledge orchestration platform
    • Sector: Enterprise IT
    • What it could do: Orchestrate internal/external sources with parallel tool calls, verification layers, and adaptive scheduling; unify research workflows across teams.
    • Tools/products: ResearchOps hub; policy-based governance; budget-aware scheduling; observability for tool-call health.
    • Assumptions/dependencies: Data security and access control; change management; integration with existing KM systems.
  • Curriculum and syllabus design assistants
    • Sector: Education
    • What it could do: Parallel retrieval of pedagogical resources, standards, and assessments; source verification; alignment with learning objectives.
    • Tools/products: Education content pipelines; alignment engines; citation and licensing management.
    • Assumptions/dependencies: Pedagogical oversight; licensing for educational materials; bias mitigation.
  • Training open-source LLMs for effective parallel tool use
    • Sector: AI Research, Open Source
    • What it could do: Improve open models’ ability to orchestrate parallel calls (currently limited gains vs. proprietary models), including tool consistency, verification, and scheduling.
    • Tools/products: Datasets and curricula emphasizing parallel tool calling; evaluation suites; fine-tuning recipes.
    • Assumptions/dependencies: Compute budgets; community collaboration; synthetic data generation for complex research tasks.
  • Energy and climate market monitoring
    • Sector: Energy, Finance
    • What it could do: Parallel retrieval of operator data, emissions reports, policy updates; verification across sources; early warning analytics.
    • Tools/products: Sector connectors; unit normalization; query decomposition by instrument/region/time; provenance dashboards.
    • Assumptions/dependencies: Access to market/operator APIs; complex domain schemas; regulatory compliance.
  • Clinical decision support with verified evidence (cautious deployment)
    • Sector: Healthcare
    • What it could do: Long-term, tightly governed CDS leveraging parallel evidence synthesis with strict verification and provenance for safety-critical contexts.
    • Tools/products: Verified evidence layers; clinician UI with confidence and source scores; regulatory-grade audit trails.
    • Assumptions/dependencies: Regulatory approvals; rigorous validation; integration with EHR systems; clinical governance and liability frameworks.

Glossary

  • Action A_t: The single tool call issued by the agent at step t in a sequential trace. "a typical LLM outputs a reasoning thought R_t and a single tool call A_t."
  • Agent trace: The ordered record of the agent’s reasoning steps, tool calls, and observations across an interaction. "\Cref{fig:main_figure} shows the difference in agent trace between the single tool calling and parallel tool calling."
  • Amortization (of reasoning cost): Reducing average computational cost by sharing or consolidating reasoning across multiple actions. "it amortizes the computational cost of reasoning; we condense mm distinct reasoning traces into a single RtR_t, significantly reducing the number of decoding tokens and LLM generation latency."
  • BrowseComp: A benchmark for evaluating browsing and information-seeking agents. "We conduct our experiments across three widely adopted deep-research benchmarks: 1) BrowseComp \citep{wei2025browsecomp}, 2) Humanity's Last Exam (HLE) \citep{phan2025humanity}, and 3) GAIA \citep{mialon2023gaia}."
  • Chain-of-thought reasoning: A technique where models generate explicit intermediate reasoning steps before answering. "The work of~\citep{pan2025learning} proposes to replace serialized chain-of-thought reasoning with coordinated parallel reasoning to improve latency; however, it does not target the agent setting where tool calls are incurred."
  • Context management: Strategies to control, prune, or organize long interaction histories and information within the model’s context window. "Consequently, recent works have focused on scaling context length or implementing smarter context management strategies to handle deeper reasoning and more tool calls."
  • End-to-end latency: The total time from initiating an agent’s process to producing the final output, including thinking and tool execution. "In summary, parallel tool calling optimizes both the end-to-end latency of the agent rollout and the cost of LLM API usage by reducing both the iteration count and total token consumption."
  • Explore-then-exploit: A strategy that starts with broad exploration and later focuses on the most promising directions. "Ascending achieves the worst performance while Descending achieves the best, suggesting that the explore-then-exploit strategy contributes to better performance."
  • Function calling: The mechanism by which an LLM invokes external tools via structured function interfaces. "You have access to various specialized tools through function calling"
  • GAIA: A benchmark for general AI assistants covering diverse tasks. "We conduct our experiments across three widely adopted deep-research benchmarks: 1) BrowseComp \citep{wei2025browsecomp}, 2) Humanity's Last Exam (HLE) \citep{phan2025humanity}, and 3) GAIA \citep{mialon2023gaia}."
  • Hallucination: Fabrication of content by a model or tool despite insufficient or incorrect input. "The tool’s internal summarization model, however, hallucinated an answer based on this empty input."
  • Humanity's Last Exam (HLE): A challenging benchmark of text-only tasks for evaluating research agents. "Because the evaluated models are not multimodal, we restrict our testing to the text-only subsets of HLE (2,158 samples) and GAIA (103 samples)."
  • JINA API: A web scraping service used to fetch and process webpage content. "We use JINA API for scraping"
  • LLM: A transformer-based model trained on large corpora to perform language tasks, often with tool-use capabilities. "State-of-the-art LLMs increasingly support the capability of parallel tool calling"
  • MCP-Universe: An agent framework based on the Model Context Protocol, used for benchmarking tool-enabled LLMs. "We adopt the agent framework from MCP-Universe~\citep{mcpuniverse} for evaluation."
  • Multimodal: Capable of processing multiple input types (e.g., text, images) rather than text-only. "Because the evaluated models are not multimodal, we restrict our testing to the text-only subsets of HLE (2,158 samples) and GAIA (103 samples)."
  • Observation O_t: The environment’s feedback returned to the agent after a tool call at step t. "The environment executes this call and returns an observation O_t."
  • Orchestration: Coordination logic for managing multiple agents or components to work together. "Unlike existing approaches that rely on complex multi-agent orchestration to parallelize workloads"
  • Parallel reasoning: Generating multiple reasoning paths concurrently and aggregating their outcomes. "Longcat~\citep{longcat2026flashthinking} introduced parallel reasoning to generate multiple reasoning paths within the same turn, aggregating the outcomes via summarization."
  • Parallel tool calling: Issuing multiple tool calls simultaneously within a single reasoning step. "Parallel tool calling extends this paradigm by allowing the agent to generate multiple tool calls simultaneously."
  • Parallel-agent reinforcement learning (PARL): A training approach where agents learn to decompose tasks into parallelizable sub-tasks via reinforcement learning. "Kimi-K2.5~\citep{moonshot2026kimik25} proposes an agent swarm framework and trains the model with the newly proposed parallel-agent reinforcement learning (PARL)."
  • Query decomposition: Splitting a complex search request into simpler sub-queries to improve retrieval effectiveness. "Observation 3. Parallel search enhances retrieval effectiveness through query decomposition."
  • Reasoning thought R_t: The model’s internal reasoning content at step t prior to tool calls or answers. "a typical LLM outputs a reasoning thought R_t and a single tool call A_t."
  • Serper API: A Google-based search API used by the agent for web queries. "We use Serper API for search"
  • Sub-agent: A specialized helper agent invoked by a main agent to perform sub-tasks or tool use. "In each iteration, the main agent reasons based on the current context without direct tool calling; instead, a sub-agent is invoked to use tools and solve sub-tasks."
  • Summarization LLM: A LLM used to condense scraped content and extract targeted information. "it scrapes the full content of the webpage and passes it, along with the query, to a summarization LLM to extract the specific information required."
  • Thinking mode: Model configuration controlling explicit reasoning behaviors; here disabled for a faster summary model. "We use Gemini-2.5-Flash (with thinking mode disabled) for the summary model due to its balance of affordability and effectiveness."
  • Token consumption: The number of tokens processed or generated by the LLM, affecting cost and latency. "In summary, parallel tool calling optimizes both the end-to-end latency of the agent rollout and the cost of LLM API usage by reducing both the iteration count and total token consumption."
  • Tool call scheduler: A policy that decides how many tool calls to issue per step to balance exploration and efficiency. "we present a preliminary exploration of different tool call schedulers for parallel tool calling."
  • Wall-clock time: Real elapsed time experienced during execution, including waiting for external tools. "because tool calls within the set At\mathcal{A}_t are executed concurrently, the wall-clock time spent waiting for environment feedback is minimized."
  • Width-depth trade-off: Balancing parallel breadth (width) and sequential steps (depth) to optimize efficiency and accuracy. "Our findings suggest that optimizing the trade-off between width and depth is a critical pathway toward high-efficiency deep research agents."

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 48 likes about this paper.