Deep Research Agents: A Systematic Examination And Roadmap

Published 22 Jun 2025 in cs.AI | (2506.18096v1)

Abstract: The rapid progress of LLMs has given rise to a new category of autonomous AI systems, referred to as Deep Research (DR) agents. These agents are designed to tackle complex, multi-turn informational research tasks by leveraging a combination of dynamic reasoning, adaptive long-horizon planning, multi-hop information retrieval, iterative tool use, and the generation of structured analytical reports. In this paper, we conduct a detailed analysis of the foundational technologies and architectural components that constitute Deep Research agents. We begin by reviewing information acquisition strategies, contrasting API-based retrieval methods with browser-based exploration. We then examine modular tool-use frameworks, including code execution, multimodal input processing, and the integration of Model Context Protocols (MCPs) to support extensibility and ecosystem development. To systematize existing approaches, we propose a taxonomy that differentiates between static and dynamic workflows, and we classify agent architectures based on planning strategies and agent composition, including single-agent and multi-agent configurations. We also provide a critical evaluation of current benchmarks, highlighting key limitations such as restricted access to external knowledge, sequential execution inefficiencies, and misalignment between evaluation metrics and the practical objectives of DR agents. Finally, we outline open challenges and promising directions for future research. A curated and continuously updated repository of DR agent research is available at: {https://github.com/ai-agents-2030/awesome-deep-research-agent}.

Abstract PDF Upgrade to Chat

Summary

The paper surveys Deep Research agents, detailing their integration of LLM reasoning with dynamic retrieval and multi-agent protocols.
It systematically compares DR agent architectures, highlighting trade-offs between static and dynamic workflows and different planning strategies.
The study outlines future research directions, emphasizing challenges in retrieval scope, multimodal benchmarks, and agent interoperability.

Deep Research Agents: A Systematic Examination And Roadmap

This paper (2506.18096) presents a comprehensive survey and analysis of Deep Research (DR) agents, which are AI systems powered by LLMs and designed to tackle complex informational research tasks. The survey covers the foundational technologies, architectural components, benchmarks, and future research directions in the rapidly evolving field of DR agents.

Background and Foundational Concepts

The paper begins by reviewing the advances in reasoning capabilities of LLMs, particularly focusing on Chain-of-Thought (CoT) prompting and its variants. It highlights the limitations of current reasoning frameworks, such as hallucinations and static knowledge, which motivate the need for DR agents that integrate external information sources and real-time retrieval mechanisms. The survey also discusses the evolution of Retrieval-Augmented Generation (RAG) from static pipelines to agentic RAG, which incorporates iterative retrieval and dynamic workflow adjustments. It also emphasizes the importance of Model Context Protocol (MCP) and Agent-to-Agent (A2A) protocols for enabling interoperability and collaboration in multi-agent systems.

Core Components and Methodologies

The survey explores the core components that constitute DR agents (Figure 1):

Figure 1: A structural overview of a DR agent in an multi-agent architecture for ease of illustration.

It compares API-based and browser-based search engine integration, highlighting the trade-offs between efficiency and comprehensiveness. The paper also examines tool use capabilities, such as code interpreters, data analytics, and multimodal processing, which enable DR agents to interact with external environments and generate structured insights. The classification of DR agent architectures based on static versus dynamic workflows, planning strategies, and single-agent versus multi-agent configurations is systematically analyzed (Figure 2).

Figure 2: Comparison of DR Workflows: (1) Static vs. Dynamic Workflows: Static workflows rely on predefined task sequences, while dynamic workflows allow LLM-based task planning. (2) Planning Strategies: Three types include: planning-only (direct planning without clarifying user intent), intent-to-planning (clarifying intent before planning), and unified intent-planning (generating a plan and requesting user confirmation). (3) Single-Agent vs. Multi-Agent: Dynamic workflows can be categorized to dynamic-multi-agent systems (tasks distributed across specialized agents) or dynamic-single-agent systems (a LRM autonomously updates and executes tasks).

The memory mechanisms employed by DR agents for long-context optimization, including context window expansion, intermediate step compression, and external structured storage, are explored. Furthermore, the paper discusses tuning methodologies such as prompt-driven structured generation, LLM-driven prompting, fine-tuning strategies, and reinforcement learning approaches aimed at optimizing agent performance. The survey also touches on non-parametric continual learning techniques that allow DR agents to self-evolve by dynamically adapting external tools and workflows.

Industrial Applications and Benchmarks

The survey provides an overview of several industrial applications of DR agents, including OpenAI Deep Research, Gemini Deep Research, Perplexity Deep Research, Grok DeepSearch, and Microsoft Copilot Researcher and Analyst. It highlights the key advancements and technological implementations of these systems. The benchmarks used for evaluating DR systems are categorized into question-answering and task execution scenarios, with a critical evaluation of their limitations and misalignment with the practical objectives of DR agents.

Challenges and Future Directions

The paper identifies several open challenges and promising directions for future research (Figure 3):

Figure 3: An overview of DR agents evolution over years.

These include expanding retrieval scope beyond traditional methods, enabling asynchronous parallel execution, developing comprehensive multi-modal benchmarks, and optimizing multi-agent architectures for enhanced robustness and efficiency. The importance of addressing these challenges to enable DR agents to function as truly autonomous and adaptable research assistants is emphasized.

Conclusion

The survey concludes by summarizing the advancements in DR agents and highlighting their potential to transform complex research workflows and drive innovation across various domains. It emphasizes the need for continued research to address the identified challenges and unlock the full potential of DR agents.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Glossary

off on

Practical Applications

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview: What this paper is about

This paper is a guide to a new kind of AI system called “Deep Research (DR) agents.” Think of DR agents as smart, tireless research assistants. They don’t just answer simple questions—they plan multi-step investigations, search the web and databases in real time, use tools (like code, charts, and calculators), keep notes, and then write clear, structured reports. The paper explains how these agents work, compares different designs, reviews how they’re tested, and lays out a roadmap for making them better.

A living list of DR agent projects is here: https://github.com/ai-agents-2030/awesome-deep-research-agent

Goals and questions in simple terms

The authors set out to answer a few practical questions:

What exactly is a Deep Research agent, and how is it different from earlier AI systems?
How do these agents find information (through APIs or by browsing the web like a person)?
What tools can they use (coding, data analysis, images/video)?
How are these agents organized (one big agent vs. a team of specialized agents)?
How are they tested, and what do current tests miss?
What still doesn’t work well, and how can researchers improve these systems?

How the authors approached the topic

Instead of running one big experiment, this paper reviews many recent systems and stitches together what we’ve learned into a clear map of the field. Here’s the approach, explained with everyday analogies:

Two ways to collect information:
- API-based search: Like asking a librarian for specific, well-organized cards. It’s fast and structured but may miss content hidden behind interactive webpages.
- Browser-based search: Like walking the library yourself, opening books, clicking links, and scrolling. It finds richer, dynamic content but is slower and more complex.
Tool use:
- Code interpreter: The agent can run little programs (often Python) to calculate, clean data, or test ideas—like using a calculator and spreadsheet.
- Data analytics: Turning raw text and tables into charts, summaries, and basic statistics—like making a lab report.
- Multimodal: Handling not just text, but images, maps, PDFs, maybe audio/video—like studying with photos and diagrams, not just words.
Standard “plug and play” connections:
- Model Context Protocol (MCP): A common “plug” so different tools connect to agents safely and consistently.
- Agent-to-Agent (A2A): A shared “language” so different agents can talk, share tasks, and collaborate.
Workflows (how the agent plans and acts):
- Static workflow: A fixed recipe (step 1, step 2, step 3). Easy to follow but not flexible.
- Dynamic workflow: The agent adapts the steps as it learns more—like a detective changing the plan based on new clues.
- Planning styles:
- Planning-only: The agent plans right away based on your prompt.
- Intent-to-planning: The agent first asks clarifying questions, then plans.
- Unified intent-planning: The agent drafts a plan and asks you to confirm or adjust it.
- Single agent vs. multi-agent:
- Single agent: One very capable agent does all tasks.
- Multi-agent: A manager agent divides work among specialist agents (like a team with roles).
Memory strategies (how agents handle long, messy research):
- Bigger “backpack” (longer context window): Feed more text into the AI at once. Simple but can be expensive.
- Summarizing (compression): Keep the important bits to save space—like concise notes.
- External storage: Save info in databases or files and fetch it later—like a personal filing cabinet or note system.
Training/tuning:
- Beyond prompting: Some systems fine-tune models or use reinforcement learning (RL) to reward good research behavior (accurate retrievals, useful plans, correct answers).
Benchmarks and testing:
- The paper reviews common tests (e.g., question-answering, multi-step tasks) and explains where they fall short (for example, limited access to real-time web info or poor fit with detailed research goals).

What they found and why it matters

Here are the main takeaways, put simply:

DR agents = LLM “thinking” + real-time information + tool use + planning + reporting. They’re designed for deep, multi-step research, not just quick answers.
Two complementary ways to retrieve info:
- APIs are fast and clean but may miss complex, interactive content.
- Browsers reach dynamic, real-world pages but are slower and trickier.
- The best systems often mix both.
Tool use is central:
- Running code, analyzing data, and handling images/PDFs lets agents move from raw facts to meaningful insights.
Clear taxonomy (map of designs):
- Static vs. dynamic workflows
- Three planning styles (plan now, clarify then plan, or plan + confirm)
- Single-agent vs. multi-agent architectures
Memory is a big deal:
- Agents need good note-taking and storage strategies to avoid drowning in text.
Current benchmarks have gaps:
- Many tests don’t allow full web access, expect slow one-step execution, or use metrics that don’t match what real research needs (like report quality or source coverage).
Open challenges and roadmap:
- Improve information access (especially dynamic, real-time content).
- Run steps in parallel where possible (not just one-after-another).
- Build better multimodal tests (text + images + data).
- Make multi-agent coordination more robust and efficient.
- Use shared protocols (MCP, A2A) for smooth tool and agent collaboration.
- Support “non-parametric” continual learning (agents get better by upgrading tools, memory, and workflows without retraining the whole model).

Why this matters: As DR agents get better, they can help students, scientists, journalists, and analysts do trustworthy, up-to-date research faster and more thoroughly.

What this could mean for the future

If researchers follow this roadmap:

You’ll see AI research assistants that can handle complex, real-world tasks: tracking breaking news, analyzing studies, checking sources, and drafting reliable reports.
Teams of agents will collaborate more smoothly (thanks to A2A) and use tools more easily (thanks to MCP), reducing the time spent setting up custom integrations.
Benchmarks will better match real research work, pushing systems to be more accurate, explainable, and useful.
Agents will become more adaptable without expensive retraining, by upgrading their tools and memory systems.
Overall, this could make high-quality research more accessible—helping people make evidence-based decisions at school, in the lab, and at work.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are applications that can be deployed now using the paper’s surveyed capabilities (dynamic planning, hybrid API+browser retrieval, code execution, multimodal processing, MCP/A2A integration, memory stores, and ReAct-style loops), as already demonstrated by systems such as OpenAI DR, Gemini DR, Grok DeepSearch, Perplexity DR, Manus/OpenManus, OWL, AutoGLM Rumination, Agent-R1, Search-R1, ReSearch, and DeepResearcher.

Enterprise market and competitive intelligence reports (sectors: software, e-commerce, media, finance)
- Workflow: Unified intent-planning to clarify goals; hybrid API retrieval (Google/Bing, company registries, news feeds) plus browser for press releases/blogs; memory (vector DB) to avoid redundant reads; code interpreter for charts and KPI tables; MCP connections to CRM/Confluence/Notion for distribution.
- Dependencies: API keys and rate limits; ToS/legal compliance for scraping; human review for high-stakes decisions; provenance/citation requirements.
Automated scientific literature reviews and research digests (sectors: academia, biotech, healthcare)
- Workflow: API retrieval (arXiv, PubMed, Semantic Scholar), browser PDF access, “Reason-in-Documents” compression for long texts, citation tracking, code interpreter for basic meta-analysis and plots; multi-agent division (retriever, summarizer, evidence weaver).
- Dependencies: Paywalls and institutional access; domain expert validation; consistent citation metadata; reproducibility guardrails.
Policy briefs and horizon scanning (sectors: public policy, government, NGOs)
- Workflow: Intent-to-planning for stakeholder alignment; browser-based retrieval of committee reports, white papers, and PDFs; iterative evidence triangulation; report generation with alternative scenarios and uncertainty annotations.
- Dependencies: Timeliness of sources; bias and coverage across regions/languages; clear citation trails; editorial oversight.
Due diligence and compliance monitoring (sectors: finance, legal, enterprise risk)
- Workflow: API calls to corporate filings (e.g., EDGAR), sanctions/PEP lists; browser for investigative reporting; tool-use for entity resolution and risk scoring; memory of prior cases; audit logs for traceability.
- Dependencies: PII handling and data privacy; regulated-use approvals; false positive mitigation; secure MCP connectors to internal KYC tools.
Customer support knowledge base upkeep and drift correction (sectors: software, telecom, fintech)
- Workflow: Scheduled agentic RAG runs over release notes/forums/docs; browser extraction; MCP to CMS for draft updates; human-in-the-loop approval; memory for change diffs.
- Dependencies: Content governance; CMS permissions; rate limits on community platforms; rollback mechanisms.
Consumer product decision assistant (sectors: daily life, e-commerce)
- Workflow: Unified intent-planning to capture user constraints; browser-based retrieval of reviews/spec sheets; price-history analysis via code interpreter; side-by-side comparisons with cited claims.
- Dependencies: Review spam/affiliate bias; ad-heavy pages; timely refresh; clear disclaimers.
Sales/account research packs (sectors: B2B software, services)
- Workflow: MCP connectors to CRM; search APIs plus browser for public signals; report generation (org charts, initiatives, tech stack) with citations; manager-agent loop for quality control.
- Dependencies: Platform ToS (e.g., LinkedIn signals); data privacy; internal policy compliance.
Grant and collaborator scouting (sectors: academia, research offices, non-profits)
- Workflow: Funding APIs and program pages; topic clustering via vector search; knowledge graph of PIs/institutions; email-ready summaries and deadlines.
- Dependencies: Regional disparities in data availability; link rot; eligibility nuances.
Equity research copilot for public companies (sectors: finance)
- Workflow: EDGAR/API retrieval of 10-K/Q, browser for earnings calls and analyst notes; code interpreter for ratio/sensitivity analysis; memory for longitudinal coverage; structured thesis with risk factors.
- Dependencies: Compliance and supervision rules; not investment advice; data licensing.
Standards/RFC and tech proposal synthesizer (sectors: software, networking)
- Workflow: Browser retrieval of RFCs/issue threads; code interpreter to run example snippets/tests; structured FAQs; cross-reference maps.
- Dependencies: Correctness verification; licensing of sample code; fast refresh of evolving standards.
OSINT triage for cyber threat intel (sectors: cybersecurity)
- Workflow: Browser-based exploration of repos, forums, CVE feeds; multi-agent enrichment (IOC extraction, clustering); daily intel briefs with confidence tags.
- Dependencies: False positives; ethics/legalities of forum scraping; need for analyst validation.
Course reading packs and study guides (sectors: education)
- Workflow: Intent-to-planning for learning objectives; retrieval from open courseware/texts; multimodal content inclusion (figures/tables); spaced review schedules stored in memory.
- Dependencies: Copyright and fair use; accessibility; age-appropriate filtering.

Long-Term Applications

The following require further research, scaling, standards, or integration maturity (e.g., asynchronous parallel execution, robust multimodal reasoning, secure computer-use, A2A ecosystems, continual learning, and aligned benchmarks), as identified in the paper’s roadmap and open challenges.

Autonomous “AI Scientist” loops from hypothesis to manuscript (sectors: academia, biotech, materials)
- Workflow: Dynamic multi-agent orchestration (planner, experiment designer, analyst, writer); MCP to ELNs/LIMS and lab robots; continual memory via knowledge graphs; RL-tuned single-agent reasoning within ReAct loops for end-to-end coherence.
- Dependencies: Safe lab integration; experimental design ethics; reproducibility standards; peer-reviewable transparency.
Regulatory-grade clinical evidence synthesis and guideline drafting (sectors: healthcare)
- Workflow: Continuous ingestion from PubMed/clinical trial registries; study appraisal pipelines; multimodal inclusion (figures, scans); uncertainty quantification and audit trails; periodic updates.
- Dependencies: Regulatory approvals; clinician oversight; liability; strong provenance and bias controls.
Autonomous computer-use for enterprise transactions (sectors: operations, procurement, finance)
- Workflow: Headless browser with authenticated sessions; secure MCP connectors to ERP/SaaS; A2A task handoffs; guardrails for action limits and approvals.
- Dependencies: Identity and access management; ToS compliance; robust rollback; SOC2/ISO controls.
Real-time regulatory monitoring and impact simulation (sectors: policy, finance, energy)
- Workflow: Event-driven, asynchronous retrieval; economic or power-system simulators via code interpreter; scenario planning with sensitivity analyses; alerting to stakeholders.
- Dependencies: Reliable real-time feeds; model calibration/validation; false alarm costs.
Cross-organization agent ecosystems via A2A (sectors: research consortia, supply chains, standards bodies)
- Workflow: Agent Cards for discovery, Tasks/Artifacts for collaboration; mixed-vendor agent swarms; provenance-preserving workflows; federated knowledge graphs.
- Dependencies: Interoperability standards; data-sharing agreements; governance and dispute resolution.
Multimodal deep research over video/audio/geospatial/satellite data (sectors: media, defense, agriculture, energy)
- Workflow: Multimodal LRMs; geospatial analytics in code interpreter; hybrid API+browser for sensor metadata; temporal memory for time-series reasoning.
- Dependencies: Compute cost; data licensing and export controls; domain-specific benchmarks.
Personalized lifelong learning researchers (sectors: education, corporate L&D)
- Workflow: Unified intent-planning with learner models; non-parametric continual learning of user preferences and gaps; multimodal learning plans; privacy-preserving on-device caches.
- Dependencies: Consent and privacy; drift detection; explainability.
Enterprise “knowledge OS” with continuous indexing and self-evolving tools (sectors: all)
- Workflow: Hybrid retrieval with non-parametric continual learning; auto-instantiation of MCP servers tailored to tasks (as in Alita); RL-based query refinement; knowledge-graph memory.
- Dependencies: Change management; data governance; cost control; tool supply-chain security.
Verified research outputs with formal evaluation and provenance signatures (sectors: software, legal, academia)
- Workflow: Structured CoT with redaction-safe traces; artifact hashing; deterministic replay; benchmarks aligned with DR goals (depth, coverage, tool-use).
- Dependencies: Community standards; infrastructure for notarization; privacy-preserving explainability.
Smart-city and infrastructure planning agents (sectors: government, energy, transport)
- Workflow: Asynchronous multi-source ingestion (sensors, reports, public forums); simulation models; multi-agent negotiation of trade-offs; public-ready briefs with alternatives.
- Dependencies: Data-sharing MoUs; fairness and community engagement; long-horizon accountability.
End-to-end robo-analysts producing compliant research dossiers (sectors: finance)
- Workflow: Continuous ingestion of filings/news; model-driven theses with stress tests; automated compliance checks; A2A collaboration with audit agents.
- Dependencies: Regulatory approval; rigorous supervision; market abuse safeguards.
ESG and environmental monitoring agents (sectors: energy, manufacturing, retail)
- Workflow: Web plus sensor APIs; claim verification against third-party data; periodic ESG narratives with quantified metrics and evidence links.
- Dependencies: Data quality and greenwashing detection; standard taxonomies; assurance processes.

Notes across applications: Many rely on hybrid API- and browser-based retrieval; code interpreters for analysis; memory via vector databases and knowledge graphs; MCP for tool interoperability; A2A for cross-agent collaboration; dynamic single- or multi-agent workflows; and RL-tuned reasoning. Feasibility hinges on secure tool invocation, provenance, evaluation alignment, compute cost, and human oversight in high-stakes domains.

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Authors (12)

Collections

GitHub

GitHub - ai-agents-2030/awesome-deep-research-agent (16 stars)

Tweets

YouTube

Show All Videos

Deep Research Agents: A Systematic Examination And Roadmap: Mapping out the emerging landscape of Deep Research (DR) agents - AI systems that go far beyond simple retrieval to autonomously plan, execute, and synthesize complex multi-step research tasks. [Paper] (1 point, 0 comments)

Deep Research Agents: A Systematic Examination And Roadmap

Summary

Deep Research Agents: A Systematic Examination And Roadmap

Background and Foundational Concepts

Core Components and Methodologies

Industrial Applications and Benchmarks

Challenges and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview: What this paper is about

Goals and questions in simple terms

How the authors approached the topic

What they found and why it matters

What this could mean for the future

Practical Applications

Immediate Applications

Long-Term Applications

Open Problems

Continue Learning

Related Papers

Authors (12)

Collections

GitHub

Tweets

YouTube

Reddit