AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

Published 28 Apr 2026 in cs.AI | (2604.25256v1)

Abstract: Autonomous scientific research is significantly advanced thanks to the development of AI agents. One key step in this process is finding the right scientific literature, whether to explore existing knowledge for a research problem, or to acquire evidence for verifying assumptions and supporting claims. To assess AI agents' capability in driving this process, we present AutoResearchBench, a dedicated benchmark for autonomous scientific literature discovery. AutoResearchBench consists of two complementary task types: (1) Deep Research, which requires tracking down a specific target paper through a progressive, multi-step probing process, and (2) Wide Research, which requires comprehensively collecting a set of papers satisfying given conditions. Compared to previous benchmarks on agentic web browsing, AutoResearchBench is distinguished along three dimensions: it is research-oriented, calling for in-depth comprehension of scientific concepts; literature-focused, demanding fine-grained utilization of detailed information; and open-ended, involving an unknown number of qualified papers and thus requiring deliberate reasoning and search throughout. These properties make AutoResearchBench uniquely suited for evaluating autonomous research capabilities, and extraordinarily challenging. Even the most powerful LLMs, despite having largely conquered general agentic web-browsing benchmarks such as BrowseComp, achieve only 9.39% accuracy on Deep Research and 9.31% IoU on Wide Research, while many other strong baselines fall below 5%. We publicly release the dataset and evaluation pipeline to facilitate future research in this direction. We publicly release the dataset, evaluation pipeline, and code at https://github.com/CherYou/AutoResearchBench.

Abstract PDF Upgrade to Chat

Authors (18)

First 10 authors:

Summary

The paper introduces AutoResearchBench, a benchmark for evaluating AI agent performance on complex scientific literature discovery tasks.
It defines two key task types—Deep Research and Wide Research—highlighting the need for multi-hop reasoning and precise evidence integration.
Experimental results expose a significant performance gap, with top models achieving only around 9% accuracy or IoU, underscoring the need for robust research frameworks.

AutoResearchBench: Advancing Evaluation for Agentic Scientific Literature Discovery

Motivation and Problem Setting

AutoResearchBench addresses the critical need for evaluating AI agent systems on rigorous scientific literature discovery tasks—capabilities foundational to autonomous research. The benchmark is explicitly tailored to capture the full complexity and demands of real-world scholarly exploration, where evidence is technical, often buried within full texts, and the set of valid results is open-ended and sometimes empty. Scientific literature search is not reducible to surface-level retrieval: it requires multi-hop reasoning, deep comprehension, and the ability to construct or exhaustively enumerate answer sets under uncertain constraints.

AutoResearchBench introduces two principal task types:

Deep Research: Precise identification or abstention over the existence of a unique paper matching a conjunction of subtle, obfuscated constraints. The agent must perform multi-step reasoning and verification, often across full document context and citation chains.
Wide Research: Comprehensive recovery of an entire set of papers matching intricate conjunctions of scientific attributes, with no prior on set cardinality. Agents must maximize recall without sacrificing precision, under the pressure of unknown answer boundaries.

The benchmark is instantiated over an up-to-date arXiv-sourced full-text corpus exceeding three million papers, facilitating evaluation in realistic settings with genuine technical evidence.

Benchmark Construction and Methodology

The benchmark construction follows a two-stage, human-in-the-loop pipeline with model-augmented candidate generation, focusing on coverage, minimal sufficiency, and resistance to shortcut exploitation.

Figure 1: Overview of the benchmark construction pipeline for deep and wide research tasks.

For Deep Research, human annotators iteratively extract, refine, and paraphrase constraints from full paper context, eschewing headline clues, and rigorously verifying instance uniqueness. Negatives are constructed by perturbing constraints to ensure unsatisfiability. For Wide Research, large domain-specific candidate pools are synthesized, summarized, and filtered—then iteratively expanded using multiple LLMs and search tools, with rigorous human expert auditing to guarantee set completeness and label quality.

The result is 1,000 benchmark instances: 600 deep research (90% single-answer, 10% unsatisfiable) and 400 wide research (average valid set size 9.23).

Figure 2: Category distribution of deep and wide research tasks across major CS domains.

Task Evaluation Protocol and Metrics

AutoResearchBench employs a standardized ReAct-based agent framework, providing parity across open and closed-source LLMs as well as integrated end-to-end research agents. The evaluation is strictly contamination-resistant and leverages a unified DeepXiv toolchain, enabling direct, scalable search over full texts.

Deep Research uses exact-match accuracy (matching the unique valid document, or abstaining when none exists).
Wide Research employs set-based intersection-over-union (IoU), rewarding both precision and recall on open-set enumeration.

Emphasis is placed on interactive agent behavior, tool usage efficiency, and reasoning quality rather than static retrieval accuracy.

Experimental Results and Analysis

Main finding: All evaluated LLM agents—despite state-of-the-art performance on general web agent tasks—struggle severely on AutoResearchBench. The best closed-source model (Claude-Opus-4.6) achieves merely 9.39% accuracy for deep research; the best for wide research (Gemini-3.1-pro-preview) reaches only 9.31% IoU.

Figure 3: Illustration of complex trajectories required on deep and wide research instances, and performance collapse of flagship agents.

Most other models remain below 5% on both metrics. This represents an order-of-magnitude performance gap compared to previous web-based agent benchmarks, where top models regularly surpass 80%. Interactivity analysis reveals that simply increasing trajectory length or tool call count does not correlate with better outcomes; more steps often lead to redundant actions or illogical persistence.

Error analysis (see also Figure 4 in the paper) reveals persistent deficits:

Inadequate integration of fragmented or obfuscated technical evidence.
Failure to enforce strict logical conjunctions in constraint satisfaction.
Poor boundary discrimination and result-set completeness for wide research.
Limited impact from explicit Think-style reasoning—reflection steps often produce more computation without substantive knowledge gain.
Figure 5: IoU bucket and prediction coverage analysis for wide research, illustrating the challenge of exhaustively and precisely matching open answer sets.

Tool analysis confirms that a specialized academic search index (DeepXiv), with full-text coverage, is necessary; open-web search tools yield an even steeper collapse due to missing critical context.

Scaling and Ablation Studies

Scaling test-time compute—e.g., repeating trajectories and aggregating via pass@ $k$ or oracle-best—offers limited gains, evidencing that trajectory instability is only a partial bottleneck for deep research, while recall limitations dominate wide research.

Figure 6: Test time scaling experiment across multiple models demonstrates only marginal improvement, reinforcing the irreducible complexity of the benchmark.

Ablation studies on thinking modes demonstrate that more structured reasoning does not consistently improve set coverage or identification performance, implying a need for substantially more robust agentic scientific reasoning architectures.

Implications and Future Outlook

AutoResearchBench provides compelling evidence that modern LLMs and agent-based frameworks, effective in web and general knowledge settings, do not transfer to the domain of complex, technical, open-ended scientific literature search. The tasks' difficulty is not simply due to input/document length, but rather arises from the confluence of:

Long-horizon, multi-hop reasoning needs.
Full-text and citation-network evidence integration.
Open-world, underspecified, or unsatisfiable query scenarios.
The need to balance completeness and correctness with verifiable evidence.

Success on AutoResearchBench appears to require joint progress on retrieval-augmented, logic-aware agents, scalable document analysis, principled trajectory planning, memory management for iterative hypothesis refinement, and reinforcement learning with process supervision specialized for technical and exploratory search. The results suggest that performance on traditional agentic and RAG benchmarks does not generalize and is not predictive of future agentic research capabilities.

Moreover, the dataset and tools offer a practical foundation for benchmarking incremental advances in autonomous research assistance, diagnosis of LLM reasoning and verification capabilities, and longitudinal measurement of model progress as the field pursues higher-order AI scientific skills.

Conclusion

AutoResearchBench establishes a rigorous benchmark for agentic scientific discovery, filling a critical gap in the evaluation of autonomous research capabilities. By demonstrating a substantive gap between current LLM-agent systems and real-world research needs, and by providing a detailed, controlled infrastructure for further experimentation, it marks a new standard for evaluating and driving progress in AI-assisted literature exploration and autonomous academic agents (2604.25256).

Markdown Report Issue