- The paper introduces DEEPSYNTH as a benchmark suite that rigorously evaluates deep, multi-step information synthesis from diverse global sources.
- It employs 120 expert-annotated tasks requiring multi-source navigation, planning, and structured tool use to reveal limitations in current LLMs and agentic systems.
- Empirical results show extremely low F1-scores and significant error cascades, underscoring the need for improved planning and synthesis architectures.
DEEPSYNTH is proposed as an agentic benchmarking suite targeting the capability of LLM-based agents to engage in deep information synthesis across realistic, multi-step, and globally diverse tasks. Unlike prior benchmarks that center on shallow fact retrieval, synthetic information-seeking, or uniregional and unilingual scenarios, DEEPSYNTH rigorously isolates the setting where correct completion of tasks requires multi-source navigation, structured and unstructured data integration, planning, multi-step reasoning, and robust tool-use. Each task is designed to be unanswerable by parametric retrieval or surface-level reasoning, directly targeting the limitations that persist in state-of-the-art LLMs and agentic systems despite their observed advancements in modular tool integration and web interaction.
Benchmark Construction and Task Design
The DEEPSYNTH suite comprises 120 expert-annotated tasks, each sampled from 223 curated official data sources spanning 67 countries and 7 domains (e.g., socio-economics, finance, environment, science, transportation, politics). The hallmark of this benchmark is its rigorous design pipeline: annotators collect sources, hypothesize plausible insights, validate hypotheses with manual analysis, and finally construct verifiable questions paired with stepwise gold reasoning chains. Tasks are formulated such that the ground-truth answers require the agent to:
- Navigate and gather information from an average of 4.2 web pages,
- Integrate facts from one to fifteen documents/tables per task,
- Execute non-trivial operations, including correlation detection, anomaly discovery, relational ranking, and multi-step comparative or arithmetic synthesis,
- Output structured answers (often JSON), strictly verifiable and stable over time.
Tasks are designed to be robust to memorization and immune to contamination, ensuring answers are not directly retrievable from model pretraining data or through simple internet search. Region and domain diversity is a key attribute, resulting in complex, geopolitically non-uniform information spaces.
Evaluation Protocol and Baselines
Evaluation is performed with both instruction-following LLMs (e.g., GPT-4.1, GPT-5.1, DeepSeek-R1, Gemini-Pro-2.5) and pipeline-based research agents (e.g., o3-deep-research, smolagents, OWL) that offer different combinations of tool support (e.g., code execution, simulated browsing/search, document processing). The assessment includes strict exact match (EM), string/numeric F1, and a soft LLM-as-a-judge metric accommodating minor semantic and numeric deviations. Multi-attempt best-of-N and self-consistency analyses quantify agent reliability and variance.
Empirical Results
The results indicate that current LLMs and agentic frameworks demonstrate severe limitations on DEEPSYNTH:
- F1-scores remain extremely low, with Gemini-Pro-2.5 leading among LLMs (F1 = 6.25), and o3-deep-research leading among agents (F1 = 8.97), the latter only solving three out of 120 tasks perfectly.
- Under strict EM, virtually no task is solved correctly by any model or agent.
- The LLM-judge scores, intended to be more permissive, remain consistently below 20, evidencing a lack of even partially correct outputs for most tasks.
- Tool augmentation (web-search, code execution) offers modest improvement but does not close the core gap in multi-step synthesis.
- Performance degradation is substantial as the number of intermediate required steps increases, with severe output variance and reliability failures in self-consistency evaluation.
Analytical Insights
Ablation studies underscore the multifactor nature of the DEEPSYNTH challenge: ablating web search, code, or document processing tools each leads to performance losses, with search being the most critical. Providing models with intermediate steps (planning traces) significantly boosts performance, indicating that current systems are bottlenecked by the inability to plan and decompose complex tasks rather than solely by deficiencies in tool invocation or knowledge retrieval.
Region-specific analysis reveals high geographic bias: no tested agent solves Africa-based tasks, highlighting insufficient model coverage and generalization to low-resource informational geographies. Error analysis finds navigation and synthesis errors predominate, with early-step failure in multi-step chains leading to near-total error cascade—structural weaknesses in both retrieval and reasoning.
Comparison to Prior Benchmarks
DEEPSYNTH contrasts sharply with benchmarks such as GAIA [Mialon et al., 2023], BrowseComp [Wei et al., 2025], and AssistantBench [Yoran et al., 2024], all of which suffer from limited synthesis, synthetic or factoid orientation, or lack of multi-source and multi-regional coverage. DEEPSYNTH is unique in posing verifiable, multi-part synthesis objectives that are globally distributed and operationally relevant.
Implications and Future Directions
DEEPSYNTH formally demonstrates that information synthesis is a composite systems challenge, implicating the interplay of planning, dynamic web navigation, data integration, and multi-step logical reasoning. These findings call into question the efficacy of LLMs and agentic systems for robust, real-world analysis—especially in policy, scientific review, or synthetic reporting—without substantial advances in structured tool-centric architectures, reasoning trace validation, and improved handling of regionally-diverse and noisy information ecosystems.
This benchmark sets a clear agenda for the research community: future developments should focus on robust agentic planning, faithful reasoning with intermediate state validation, region-aware synthesis, and explicit mitigation of error propagation in deep reasoning pipelines. It quantifies a significant open frontier, motivating efforts to bridge the synthesis gap between factual extraction and coherent multi-modal, multi-source insight generation.
Conclusion
DEEPSYNTH presents a comprehensive, high-complexity benchmark for deep information synthesis, denoting a substantial, rigorously quantified performance gap in current LLM and agentic architectures. By uniquely targeting realistic, multi-source, and geopolitically-diverse synthesis, DEEPSYNTH provides a foundation for evaluating and advancing the next generation of AI agents capable of trustworthy, tool-augmented, and genuinely insightful reasoning (2602.21143).