DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Published 13 Jun 2025 in cs.CL and cs.IR | (2506.11763v1)

Abstract: Deep Research Agents are a prominent category of LLM-based agents. By autonomously orchestrating multistep web exploration, targeted retrieval, and higher-order synthesis, they transform vast amounts of online information into analyst-grade, citation-rich reports--compressing hours of manual desk research into minutes. However, a comprehensive benchmark for systematically evaluating the capabilities of these agents remains absent. To bridge this gap, we present DeepResearch Bench, a benchmark consisting of 100 PhD-level research tasks, each meticulously crafted by domain experts across 22 distinct fields. Evaluating DRAs is inherently complex and labor-intensive. We therefore propose two novel methodologies that achieve strong alignment with human judgment. The first is a reference-based method with adaptive criteria to assess the quality of generated research reports. The other framework is introduced to evaluate DRA's information retrieval and collection capabilities by assessing its effective citation count and overall citation accuracy. We have open-sourced DeepResearch Bench and key components of these frameworks at https://github.com/Ayanami0730/deep_research_bench to accelerate the development of practical LLM-based agents.

Abstract PDF Upgrade to Chat

Summary

The paper introduces DeepResearch Bench, a benchmark that evaluates DRAs using novel RACE (report quality) and FACT (web retrieval) frameworks.
It employs a large dataset filtered from 96,147 queries to create 100 research tasks distributed across 22 domains.
Experimental analysis shows leading performance of models like Gemini-2.5-Pro and Claude-3.7-Sonnet, aligning closely with human judgments.

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

This paper introduces DeepResearch Bench, a novel benchmark designed for evaluating the capabilities of Deep Research Agents (DRAs). The benchmark comprises 100 research tasks spanning 22 domains, crafted by domain experts to mirror real-world research requirements. The authors address the limitations of existing evaluation frameworks by proposing two new methodologies: RACE (Reference-based Adaptive Criteria-driven Evaluation) for assessing report quality and FACT (Factual Abundance and Citation Trustworthiness) for evaluating information retrieval and citation accuracy. The efficacy of these frameworks is validated through comprehensive experiments and human studies.

DeepResearch Bench and Data Collection

The construction of DeepResearch Bench involved a systematic approach to ensure its relevance and representativeness. The authors began by analyzing a dataset of 96,147 user queries to determine the distribution of research topics. These queries were filtered using DeepSeek-V3-0324 to identify those aligning with deep research requirements, resulting in 44,019 queries.

Figure 1: Overview of DeepResearch Bench including the distribution analysis and dataset construction pipeline (a), RACE (b), and FACT (c).

The queries were then classified into 22 distinct topic domains based on the WebOrganizer taxonomy, providing insights into real-world user demand for deep research within these areas. The benchmark tasks were subsequently created, balancing Chinese and English tasks, and refined by Ph.D. holders and senior practitioners. Examples of the task distributions are shown in (Figure 2) and (Figure 3).

Figure 2: Topic distribution of the filtered deep-research tasks, including a donut chart showing proportional share of each topic domain (left) and a bar chart listing the absolute task counts for all 22 domains (right).

Figure 3: Two example tasks from DeepResearch Bench.

Evaluation Methodologies: RACE and FACT

RACE: Report Quality Evaluation

The RACE framework addresses the shortcomings of traditional evaluation methods by dynamically generating task-specific weights and criteria. It leverages the LLM-as-a-Judge method, using Gemini-2.5-pro to assess report quality based on four dimensions: Comprehensiveness, Insight/Depth, Instruction-Following, and Readability.

The framework operates in three key stages: dynamic weight and adaptive criteria generation, reference-based scoring, and overall score calculation. By comparing target reports with high-quality references, RACE mitigates biases and ensures a more accurate evaluation of report quality.

FACT: Web Retrieval Evaluation

The FACT framework evaluates the factual grounding of report content and the agent's effectiveness in retrieving and utilizing web-based information. This framework employs a Judge LLM to extract statement-URL pairs from reports, deduplicate them, and assess whether the webpage content supports each statement. Citation Accuracy and Average Effective Citations per Task are the primary metrics used to gauge performance.

Experimental Results and Analysis

The authors evaluated several DRAs and LLMs with search tools using DeepResearch Bench and the RACE and FACT frameworks. The results, as shown in (Figure 4), indicate that Gemini-2.5-Pro Deep Research achieves leading overall performance in report quality, while also obtaining a high number of effective citations.

Figure 4: Overview of agent performance on DeepResearch Bench, including generated report quality scores across evaluation dimensions (left) and agent citation accuracy and average number of effective citations (right).

Claude-3.7-Sonnet with Search delivers impressive performance among the LLMs with search tools, even exceeding Grok Deeper Search. The performance of different models across various topics and languages evaluated by RACE are shown in (Figure 5).

Figure 5: Scores of different models across various topics and languages evaluated by RACE, where red indicates a higher score and blue indicates a lower score.

Further experiments validated the human consistency of the RACE framework, demonstrating a strong correlation between RACE scores and human judgments.

Conclusion

DeepResearch Bench presents a valuable resource for evaluating DRAs, offering a comprehensive benchmark and evaluation frameworks that align with human judgment. The results and analyses presented in this study provide insights into the strengths and weaknesses of current DRAs, paving the way for future advancements in AI agent systems. The authors acknowledge the limitations of the benchmark, including its scale and potential domain coverage bias, and outline plans for future improvements.

Markdown