DeepHalluBench: Hallucination Analysis

Updated 6 February 2026

DeepHalluBench is a process-centric benchmark that decomposes DRAs’ reasoning to identify both explicit and implicit hallucination errors during planning and summarization.
It employs the PIES taxonomy to categorize hallucinations, distinguishing errors such as action deviation, fabrication, and noise domination for precise error localization.
Benchmark results reveal significant disparities among DRAs, highlighting the need for early error correction, hybrid retrieval-verification strategies, and debiasing techniques.

DeepHalluBench is a process-centric benchmark for diagnosing, localizing, and quantifying hallucinations in Deep Research Agents (DRAs) by auditing their complete reasoning trajectories. Unlike previous outcome-focused evaluations, DeepHalluBench emphasizes the systemic propagation of errors across planning and summarization, offering fine-grained insights into the reliability and limitations of contemporary agentic research systems (Zhan et al., 30 Jan 2026).

1. The PIES Taxonomy of Hallucinations

DeepHalluBench operationalizes the PIES taxonomy, a two-dimensional categorization schema for hallucinations in DRAs. The first axis distinguishes functional components—Planning (the agent’s internal decomposition of queries into executable steps) and Summarization (the synthesis of observations and claims). The second axis categorizes error properties into Explicit (incorrect or fabricated content) vs. Implicit (neglect or omission of required constraints).

This yields four distinct hallucination types:

Explicit Planning (“Action Hallucination”): Manifesting as Action Deviation (contradicting intent), Action Redundancy (unnecessary repetition), and Action Propagation (errors inherited from upstream hallucinations).
Implicit Planning (“Restriction Neglect”): Silent omission of user-imposed constraints (e.g., failing to enforce location or date filters).
Explicit Summarization (“Claim Hallucination”): Fabrication (unsupported claims) and Misattribution (incorrect citation or evidence mapping).
Implicit Summarization (“Noise Domination”): Failing to surface essential evidence in the final output, with summaries dominated by irrelevant or generic content.

PIES is the first taxonomy to explicitly partition hallucinations along these axes, enabling error localization at all phases of a DRA’s research workflow.

2. Evaluation Framework and Scoring Methodology

DeepHalluBench introduces an atomic decomposition of the research process for fine-grained auditability:

Atomic Sub-Queries: Each user-specified restriction is parsed as an independent clause.
Atomic Actions: Each planning step is considered a unitary, verifiable operation.
Atomic Claims: Each factual assertion in the summary, accompanied by its evidential scope.

Hallucination metrics are formally defined for each PIES axis:

Explicit Summarization ( $H_{es}$ ): Fraction of atomic claims classified as fabrication or misattribution:

$H_{es} = \frac{|C_x| + |C_m|}{|C_{total}|}$

Implicit Summarization ( $H_{is}$ ): Penalty-weighted ratio for ignored, semantically clustered evidence:

$H_{is} = \frac{\sum P(c)}{P_{worst}}$

Explicit Planning ( $H_{ep}$ ): Frequency of action deviation, redundancy, and propagation:

$H_{ep} = \frac{|A_{dev}| + |A_{red}| + |A_{prop}|}{|A_{total}|}$

Implicit Planning ( $H_{ip}$ ): Fraction of atomic sub-queries not addressed by any plan step:

$H_{ip} = \frac{|Q_{total} \setminus Q_{executed}|}{|Q_{total}|}$

Composite Score ( $H$ ): Arithmetic mean of the four axes:

$H = \frac{H_{es} + H_{is} + H_{ep} + H_{ip}}{4}$

The framework further incorporates adversarial “no-answer” tasks via atomic query perturbations—systematically altering queries until their constraint sets have empty intersections—probing the agent’s capacity for precise task refusal.

3. DeepHalluBench Dataset Construction

DeepHalluBench is assembled through a multi-stage curation process:

Data is pooled from Mind2Web2 (130 open-ended search queries), ReportEval (100 open-ended research tasks), and BrowseComp ( $H_{es} = \frac{|C_x| + |C_m|}{|C_{total}|}$ 01,200 close-ended queries).
Gemini Deep Research is used to generate agent trajectories, which are scored for hallucination propensities.
The 75 most hallucination-prone tasks are retained using these scores, balanced across sources and open/closed task types.
An additional 25 adversarial “no-answer” cases are constructed via atomic perturbations, yielding a corpus of 100 challenging research queries.

Domain diversity is enforced:

Coverage: 11 domains (Arts, Entertainment, Economy, Science, Health, Politics, Careers, Environment, History, Sports, Lifestyle), with no single area exceeding ∼20% of the total.
Selection Bias: High retention for long-tail, structurally vulnerable domains (up to 75% in Geography/Environment), lower for more rigid, evidence-rich domains (e.g., <20% in Economy).
Query Diversity: Balanced representation of close-ended, open-ended, and unsolvable queries, calibrated to maximize stress on both planning and summarization subsystems.

4. Benchmarking Results and Error Profiling

Six state-of-the-art DRAs were assessed on DeepHalluBench: OpenAI, Gemini, Perplexity, Qwen, Grok (proprietary), and Salesforce (open-source). Key findings:

Agent	Hallucination Score $H_{es} = \frac{\|C_x\| + \|C_m\|}{\|C_{total}\|}$ 1 (lower=better)	Notable Failure Modes
Qwen	0.149	Strong sub-query adherence, low noise
OpenAI	0.155	Moderate sub-query neglect, tendency for confident fabrication
Gemini	0.175	Propagation failures, overconfident answer bias
Salesforce	0.185	Over-conservatism, misattribution dominance, late-stage collapse
Perplexity	0.208	High noise, moderate sub-query neglect
Grok	0.378	Severe planning opacity, confident fabrication, high noise

Explicit summarization errors (fabrication or misattribution) remain nonzero across all proprietary systems (fabrication ≈ 0.15 for OpenAI and Grok), while Salesforce’s “misattribution illusion” exceeds 0.20. Implicit summarization (noise domination) is elevated in Grok and Perplexity (≈ 0.33), with Qwen and Salesforce scoring best, though the latter does so at the cost of reduced recall. Explicit planning errors are generally infrequent (<5%), but error propagation in Gemini and action redundancy in OpenAI are notable. Qwen demonstrates the lowest implicit planning errors (≈ 0.11); Salesforce, Gemini, and Perplexity underperform (0.18–0.30).

Close-ended settings amplify hallucination visibility, with rigid constraints surfacing planning failures and summarization gaps. Agents diverge in rejection strategies for unsolvable queries: over-confident systems (Gemini, Grok) never reject, forcibly hallucinating; over-conservative systems (Qwen, Salesforce) reject answerable queries but succeed when rejection is warranted. OpenAI and Perplexity display moderate discernment, but no model exhibits robust intent extraction.

Statistical analysis highlights early error propagation in proprietary agents (>57% of root errors in early stages) and late-stage context collapse in Salesforce (>40% errors at conclusion). Two cognitive biases are prominent: Anchor Effect (fixation on early retrieval, under-utilization of later results) and Homogeneity Bias (preferential use of repetitive clusters, neglect of novel or singleton chunks), the latter correlating with increased noise ( $H_{es} = \frac{|C_x| + |C_m|}{|C_{total}|}$ 2, $H_{es} = \frac{|C_x| + |C_m|}{|C_{total}|}$ 3 for Salesforce).

5. Root Causes and Diagnostic Insights

Dominant causes of unreliability in DRAs, as revealed by DeepHalluBench, are:

Hallucination Propagation: Initial fabrications in planning or summarization irrevocably contaminate downstream reasoning, yielding irretrievable failures.
Cognitive Attention Biases: Both anchor effect and homogeneity bias blunt the capacity of DRAs to integrate corrective or heterogenous evidence, elevating both explicit and implicit hallucination rates.

This suggests that pure architectural scaling or instruction tuning are insufficient; fundamental issues arise from process-level error compounding and biased attention distribution over retrieved evidence.

6. Architectural and Algorithmic Recommendations

Principal recommendations derived from DeepHalluBench analyses include:

Early-Stage Error Correction: Deploy “hallucination checkpoints” after each reasoning loop. Employ retrieve-then-verify strategies combining NLI and LLMs, with “claim gating” to block downstream planning on unsupported inferences.
Long-Context Debiasing: Adjust retrieval attention to emphasize new or singleton evidence. Implement scheduled memory refresh to cyclically review the full document set, countering anchor effect and memory stagnation.
Hybrid Retrieval-Verification Pipelines: Leverage coarse-to-fine retrieval (dense embeddings, reranking), followed by NLI-based filtering and LLM adjudication for ambiguous cases. Dynamically penalize noise via rank-based approximations of $H_{es} = \frac{|C_x| + |C_m|}{|C_{total}|}$ 4.
Fallback and Query Handling: Calibrate refusal thresholds by sub-query coverage—not global uncertainty—differentiating over-conservatism from valid rejection. Initiate multi-hop re-planning if sub-queries are unsatisfied, rather than defaulting to final refusal.

These strategies target the structural weaknesses exposed by DeepHalluBench, aiming to lower error propagation, mitigate attention biases, and optimize the trade-off between coverage and precision in agentic research tasks.

7. Significance and Future Directions

DeepHalluBench marks a paradigm shift from terminal-output evaluations to process-aware, trajectory-level diagnostics for agentic systems. By decomposing research workflows and rigorously dissecting hallucination etiology, it surfaces subtle, compounding error modes fundamental to current DRA architectures. The benchmark establishes new standards for process transparency, error localization, and adversarial robustness, defining essential requirements for the next generation of reliable research agents. Data and code are available at https://github.com/yuhao-zhan/DeepHalluBench (Zhan et al., 30 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Why Your Deep Research Agent Fails? On Hallucination Evaluation in Full Research Trajectory (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeepHalluBench.