Papers
Topics
Authors
Recent
Search
2000 character limit reached

Deep Research Bench II Evaluation

Updated 20 January 2026
  • Deep Research Bench II is a human-grounded benchmark that defines LLM-based deep research systems through multistep evidence retrieval, synthesis, and report generation.
  • It employs a rigorous LLM+human-in-the-loop pipeline, using 9,430 atomic rubrics to assess information recall, analysis, and presentation in complex research tasks.
  • Empirical results reveal that even top systems achieve below 50% overall rubric satisfaction, highlighting significant gaps in retrieval and reasoning compared to human experts.

Deep Research Bench II (DRB-II) is a rigorous, human-grounded benchmark for the evaluation of LLM-based deep research systems (DRS) capable of multistep evidence retrieval, synthesis, and report generation. The benchmark was developed to address critical limitations of previous evaluation protocols, specifically the inadequacy of coarsely defined or LLM-generated rubrics and the lack of interpretability and verifiability against human expert standards. DRB-II introduces a suite of 132 tasks spanning 22 real-world domains, with each system-generated report scored against over 9,400 fine-grained, binary, expert-derived criteria covering recall, analysis, and presentation. The framework reveals substantive gaps between current deep research agents and human-expert performance, motivating future research in retrieval, reasoning, and user-adaptive report generation (Li et al., 13 Jan 2026).

1. Scope, Task Design, and Domain Coverage

DRB-II operationalizes the assessment of deep research systems—defined as LLM-based agents that must autonomously (a) search for open-source evidence, (b) synthesize heterogeneous information into high-level insights, and (c) produce coherent investigative reports. The benchmark spans 132 grounded research scenarios (tasks) drawn from 22 domains, such as:

  • Science & Technology
  • Finance & Business
  • Software Development
  • Health
  • Education & Jobs
  • Law & Governance
  • Art & Design
  • Entertainment, Sports, Games, and more

Each task in DRB-II is directly seeded from an open-access, expert-written investigative or review article, ensuring that (i) the temporal scope and investigation requirements match genuine expert analyses, and (ii) there exist gold-standard answers that can ground fine-grained rubric extraction. This process guarantees coverage of the same real-world signals and decision boundaries that would be expected of a domain expert under real research demands (Li et al., 13 Jan 2026).

2. Rubric Construction and Validation Pipeline

A four-stage LLM+human-in-the-loop pipeline underpins DRB-II’s rubric development to ensure atomicity, specificity, verifiability, and human judgment alignment:

  1. LLM Extraction: Each article is reverse-engineered into a user-facing research prompt, with an LLM tasked to extract atomic, binary rubrics spanning three dimensions: Information Recall, Analysis, and Presentation.
  2. Self-Evaluation Iteration: The same LLM uses provisional rubric sets for auto-scoring. If rubric sets yield <90% accuracy against the source article on key dimensions, the rubric extraction is repeated, enforcing factual validity.
  3. Manual Revision: Trained annotators audit all rubrics, splitting compound statements, resolving ambiguity, removing redundancy, and enforcing explicit numerical verification.
  4. Expert Review & Refinement: Domain specialists apply over 400 human-hours to validate that all rubrics represent essential, verifiable, and domain-appropriate criteria.

The outcome is 9,430 atomic rubrics (mean 71 per task), each pass/fail, spanning the three dimensions as follows:

  • Information Recall: coverage of essential, publicly available facts (e.g., stating the rise of gold price over a specified period, or citing a required authoritative source).
  • Analysis: well-defined logic or inference from retrieved facts (e.g., explaining underlying causal drivers in the data, performing comparative or historical analysis).
  • Presentation: adherence to required structure, formatting, and citation discipline (e.g., inclusion of specified tables, correct use of reference lists) (Li et al., 13 Jan 2026).

3. Evaluation Methodology and Metrics

Given a system-generated report, an LLM judge (Gemini 2.5-Pro) scores each rubric in batches (50 at a time), yielding strictly binary (pass/fail) outcomes for all 9,430 items. Metrics are as follows:

  • Overall Rubric Satisfaction:

S=1N∑i=1Nri,ri∈{0,1},N=9430S = \frac{1}{N}\sum_{i=1}^N r_i, \quad r_i \in \{0,1\}, \quad N=9430

  • Per-Dimension Scores:

SInfoRecall=1NI∑i∈Iri,SAnalysis=1NA∑i∈Ari,SPresentation=1NP∑i∈PriS_{\mathrm{InfoRecall}} = \frac{1}{N_I}\sum_{i\in I} r_i,\quad S_{\mathrm{Analysis}} = \frac{1}{N_A}\sum_{i\in A} r_i,\quad S_{\mathrm{Presentation}} = \frac{1}{N_P}\sum_{i\in P} r_i

with average per-task cardinalities NI≈52.9N_I \approx 52.9, NA≈12.8N_A \approx 12.8, NP≈5.7N_P \approx 5.7.

  • Task-level metrics: The above are also reported per task and aggregated across all 132 tasks.
  • Inter-rater reliability: In experiments, LLM-judge agreement with human annotation is ≈91.75%\approx91.75\% (accuracy) and 89.57%89.57\% (F1) for batches of 50 rubrics (Li et al., 13 Jan 2026).

4. Empirical Results and Agent Performance

Eight state-of-the-art agents, including proprietary deep research models (OpenAI-GPT-o3 Deep Research, Gemini-3-Pro Deep Research, Doubao Deep Research, Qwen3-Max Deep Research, Perplexity Research, and others) were evaluated. Results are as follows:

Model InfoRecall Analysis Presentation TotalScore
OpenAI-GPT-o3 DR 39.98 49.85 89.16 45.40
Gemini-3-Pro DR 39.09 48.94 91.85 44.60
Gemini-2.5-Pro DR 34.91 51.91 90.24 41.98
Doubao DR 34.83 49.43 83.51 40.99
Qwen3-Max DR 34.18 48.04 74.59 39.25
Grok Deep Search 33.52 42.50 91.42 39.23
Perplexity Research 33.05 44.47 79.34 38.58
Tongyi Deep Research 22.95 35.89 86.13 29.89

Salient points:

  • Even the top system passes fewer than half of rubrics (TotalScore ≲45%\lesssim45\%).
  • Information Recall is the weakest component (best ≈40%\approx 40\%), indicating severe limitations in source coverage and retrieval.
  • Analysis peaks at ≈52%\approx52\%, reflecting persistent barriers in complex reasoning, synthesis, and higher-order inference.
  • Presentation is relatively robust (≈90%\approx90\%), evidencing mature formatting and reporting capabilities contingent on content quality (Li et al., 13 Jan 2026).

5. Failure Modes and Diagnostic Insights

The DRB-II framework enables granular diagnosis of system weaknesses:

  • Retrieval Shortfalls: Agents regularly miss critical sources or underexploit available evidence, suppressing overall InfoRecall scores.
  • Reasoning Deficits: Even when sources are available, systems struggle to generate robust, causal or comparative analyses, as measured by sub-50% Analysis satisfaction.
  • Hallucination Vulnerability: Without fine-grained rubrics, generative systems can produce plausible but unsupported claims; the atomicity of DRB-II’s binary rubrics ensures such errors are systematically detected.
  • Inter-dimension correlation: High presentation scores are decoupled from recall or analysis (i.e., well-structured but shallow or incomplete reports are common) (Li et al., 13 Jan 2026).

6. Implications for System Development and Future Research

Bridging the gap between current agent performance and human-expert standards requires:

  • Enhanced dynamic query planning: Multi-hop tool use, query refinement, and iterative search cycles must be systematically integrated.
  • Explicit provenance tracking: Incorporation of evidence graphs may help maintain granularity and tractability of supporting facts.
  • External symbolic/casual modules: Augmenting LLMs with dedicated modules for specific inference types or domain reasoning is proposed.
  • User-adaptive presentation: Future presentation metrics may incorporate models of audience background and target cognitive load.
  • Broadening of task coverage and metric refinement: The open-sourcing of all tasks, rubrics, and infrastructure at https://github.com/imlrz/DeepResearch-Bench-II is intended to support continual evolution of evaluation standards and drive innovation.

7. Distinction from Prior Benchmarks and Community Impact

DRB-II addresses critical deficiencies of earlier benchmarks, notably:

  • Benchmark provenance: All evaluation rubrics derive from expert-authored ground truth articles, not from LLM generations or author heuristics, increasing interpretability and trust.
  • Granularity and scope: Over 9,400 atomic pass/fail rubrics (vs. coarse or LLM-calibrated scores) enable reliable, interpretable, and transparent system comparison.
  • Comprehensive domain coverage: 22 domains and 132 diversified tasks, all mapped to real investigative practice, establish DRB-II as a comprehensive gold standard for web-scale, open-ended research evaluation (Li et al., 13 Jan 2026).

A plausible implication is that ongoing progress in LLM-based deep research systems will increasingly be measured against DRB-II’s human-grounded, fine-grained diagnostics, with competitive leaderboard performance expected to closely track architectural advances in retrieval, reasoning, synthesis, and user-centered report generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deep Research Bench II.