Tool-Use Hallucinations in LLM Agents
- Tool-use hallucinations are failures where LLM agents fabricate or misuse APIs, resulting in non-existent, misparameterized, or inappropriate tool calls.
- They are systematically categorized into subtypes—tool-selection, tool-usage, solvability, tool-induced myopia, and bypass—with benchmarks measuring metrics like R_NTA and step localization accuracy.
- Mitigation strategies such as reliability alignment, self-verification sampling, and curriculum learning reduce hallucination rates, though they involve trade-offs with task performance.
Tool-use hallucinations are a class of failure modes in LLM–based agents, manifesting when the model improperly invokes, fabricates, or misapplies external APIs or tools that either do not exist, are irrelevant, or are incorrectly parameterized. These errors compromise agent reliability, induce wasteful computation, and can trigger real-world side-effects incongruent with user intent or environmental affordances. Tool-use hallucinations have now been rigorously categorized, benchmarked, and mechanistically analyzed across multiple lines of research, with distinct causal factors and mitigation challenges.
1. Formal Definitions and Taxonomy
Tool-use hallucination encompasses several distinct error subtypes, each formally characterized within the agent workflow:
- Tool-selection hallucination: The LLM selects or calls a tool not relevant or appropriate to the prompt, or even fabricates a non-existent tool call. If denotes the available toolset and the model output, the hallucination indicator is if calls or fabricates a tool (Yin et al., 27 Oct 2025, Xu et al., 2024, Healy et al., 8 Jan 2026).
- Tool-usage hallucination: The call references an appropriate tool but with malformed, missing, or fabricated parameters, or invokes the tool at an incorrect time (Xu et al., 2024, Healy et al., 8 Jan 2026).
- Solvability hallucination: The model incorrectly claims a query is solvable with given tools, leading to fabricated tool calls or plans (Zhang et al., 2024).
- Tool-induced myopia (TIM): The model resorts to substituting tool output for genuine multi-step reasoning, even when the tool call is correct, resulting in superficial solution paths lacking logical depth (Bayat et al., 14 Nov 2025).
- Tool-bypass error: The agent answers directly, simulating or inventing results instead of valid tool invocation (Healy et al., 8 Jan 2026).
Hierarchically, tool-use hallucinations are recognized as a subcategory of execution hallucinations within agent architectures, with distinct formalizations for single-step and multi-step (sequential) agent workflows (Lin et al., 23 Sep 2025, Liu et al., 11 Jan 2026).
2. Benchmarks and Evaluation Methods
Multiple diagnostic benchmarks have been constructed to quantify and analyze tool-use hallucinations:
- SimpleToolHalluBench (Yin et al., 27 Oct 2025): Probes two failure modes—(i) No-Tool-Available (NTA): , but the model invokes any tool; (ii) Distractor-Tool (DT): only distractors available, but the model calls a distractor or invents the needed tool. Hallucination rate formulas include .
- ToolBeHonest (ToolBH) (Zhang et al., 2024): Multi-level diagnostic—(1) solvability detection (EM), (2) solution planning (Progress Rate, PR), (3) missing-tool analysis (Matching Score, MS). Error-driving scenarios comprise missing necessary tools (MNT), potential tools (PT), and limited functionality tools (LFT).
- AgentHallu (Liu et al., 11 Jan 2026): Multi-step trajectory tracing; requires attribution of hallucinated steps (step localization accuracy) in complex task chains. Subcategories in Tool-Use: Missing Tool, Incorrect Argument, Parallel Conflict, Unnecessary Tool.
- StableToolBench (Xu et al., 2024): Focused on tool selection vs. usage hallucination, reporting overall and sample-level hallucination rates, benefit–cost utility, and cost ratios.
The current empirical consensus is that tool-use hallucinations are among the hardest agentic errors to detect and attribute, with even state-of-the-art models rarely exceeding 20% step localization accuracy in multi-step workflows (Liu et al., 11 Jan 2026).
| Benchmark | Key Metrics/Tasks | Specialized for |
|---|---|---|
| SimpleToolHalluBench | , | Absent/distractor |
| ToolBeHonest | EM, PR, MS, HallucinationRate | Multi-level/cause |
| AgentHallu | Attribution, G-EVAL | Multi-step agents |
| StableToolBench | Hallucination, Utility, Ratio | Alignment/economics |
3. Mechanisms, Triggers, and Causal Factors
Mechanistic analyses and comprehensive surveys have identified convergent root causes:
- Representation Collapse: Reinforcement learning (RL) on reasoning tasks sharpens reasoning subspaces but collapses or distorts the tool-use representation, as measured by layer-wise Centered Kernel Alignment (CKA) drops in tool contexts—out-of-distribution tool inputs yield early/mid-layer CKA compared to for math (Yin et al., 27 Oct 2025). Divergences are amplified in late-layer residual streams at hallucination points.
- Over-generalization of Reasoning Heuristics: Obtaining rewards for confident, chain-of-thought reasoning trains models to over-apply “think-then-act” even when no tool is available, leading to systematic hallucination in NTA/DT cases (Yin et al., 27 Oct 2025).
- Shallow Pattern and Documentation Limitations: Imperfect or outdated tool signatures, lack of exposure to diverse tool call patterns, and incomplete internal tool knowledge cause models to hallucinate plausible but non-existent or misparameterized calls (Lin et al., 23 Sep 2025).
- Weak Solvability Awareness: Agents often lack the ability to abstain or detect unsolvable queries under the present toolset, resulting in fabricated plans or calls (Zhang et al., 2024).
- Tool-Induced Myopia: Tool access can short-circuit multi-step reasoning, shifting model errors from local arithmetic to global logical or strategic missteps, a phenomenon quantified as ‘TIM’ (Bayat et al., 14 Nov 2025).
4. Quantitative Performance and Empirical Insights
Empirical studies reveal characteristic patterns and quantifiable reliability–capability trade-offs:
- Performance-Hallucination Correlation: Reasoning RL elevates task performance and hallucination rates in tandem; e.g., SynTool reward increases from 0.10 0.60, rises from 70% 95% (Yin et al., 27 Oct 2025). Linear regression: .
- Model/Prompt Variants: Inference-time chain-of-thought and distilled reasoning models consistently inflate hallucination rates relative to direct answer or base models (e.g., Qwen3-8B thinking mode on: = 56.8% vs. off: 36.2%) (Yin et al., 27 Oct 2025).
- Leaderboard Observations: Proprietary models (Gemini-1.5-Pro, GPT-4o) outperform best open-weight alternatives (Llama-3-70B) on complex benchmark composites, but long-form verbosity impairs open-weight planning accuracy (Zhang et al., 2024).
- Subcategory Difficulty: AgentHallu localization rates for tool-use hallucinations are 11.6% (proprietary) and 6.3% (open-source), substantially lower than for planning or retrieval hallucinations (Liu et al., 11 Jan 2026).
- TIM Effects: Tool-augmented models can gain up to 19.3 pp in answer accuracy while losing 41.5 pp in reasoning win rates, with higher tool call counts correlating to logic/assumption errors (+12 pp and +10 pp, respectively) and a steep drop in algebra/arithmetic blunders (–9 pp) (Bayat et al., 14 Nov 2025).
- Hallucination Mitigation–Capability Trade-off: Direct Preference Optimization (DPO) and prompting reduce hallucinations (e.g., from 90.2% to 55.8%), but always at the steep expense of task success or SynTool reward (drops from 0.45 to 0.34) (Yin et al., 27 Oct 2025).
5. Detection and Mitigation Strategies
A diversity of architectures and training regimes have been investigated for hallucination control:
- Reliability Alignment (Relign): Expands the agent’s action space to include indecisive moves—ChangeTools, TalkToUser—trained via supervised fine-tuning and DPO to seek clarification or replan instead of hallucinating. This combination lowers overall hallucination rates from 61.5% to 18.8%, shrinks tool usage by 76%, and boosts utility and reliability ratio by 43% (Xu et al., 2024).
- Internal Representation-Based Detection: Real-time, single-pass classifiers over LLM final-layer activations at function name/argument/closure tokens achieve up to 86.4% accuracy for hallucination detection with minimal computational overhead, outperforming multi-pass baselines (Healy et al., 8 Jan 2026).
- Self-Verification Sampling with Dynamics Modelling (DyMo+SVS): LLMs are equipped with an internal environment model that predicts the next state for a candidate tool call; at inference, candidates are scored on predicted success before being executed, allowing the agent to “refuse” ill-posed calls. This approach raises success rates (+20 pts over SFT-only) and reduces hallucinations (irrelevance category up 18 pts), reaching up to 94.5% precision among accepted calls (Guo et al., 3 Jun 2025).
- Curriculum, Contrastive, and Graph Learning: Exposing models to increasingly difficult tool usage, using pairwise reward on correct/incorrect calls, and structuring tool libraries as executable graphs help agents avoid learned “shortcuts” leading to hallucinations (Lin et al., 23 Sep 2025).
- Preference-Reinforcement and Prompting: DPO over solution pairs penalizing tool-only proofs and explicit prompting to treat tools as assistive evidence recover reasoning quality, but may marginally reduce answer accuracy (Bayat et al., 14 Nov 2025).
Mitigation strategies consistently face a trade-off: reducing hallucination rates almost always impacts task effectiveness or requires additional model complexity and latency.
6. Error Analysis, Outstanding Challenges, and Future Directions
Principal error modes are now well-mapped:
- Solvability hallucinations—misclassifying unsolvable queries as solvable—account for over 40% of deep planning errors (Zhang et al., 2024).
- Non-existent tools are generated primarily in missing-tool and potential-tool settings, especially in open-weight models.
- Parameter-level errors (incorrect argument, missing required argument) and bypassing tool invocation entirely are prevalent in real-world deployments (Healy et al., 8 Jan 2026).
- Hallucination hardcases: Attribution accuracy for tool-use hallucinations drops as multi-step trajectories lengthen and for subtle argument misspecification or “wrong execution” errors (Liu et al., 11 Jan 2026).
Key open questions and promising directions include:
- Joint reliability–capability objectives: Integrating abstention calibration and explicit hallucination penalties into reasoning RL and supervised objectives (Yin et al., 27 Oct 2025).
- Mechanistic interpretability: Further analysis of model circuits underlying tool selection and usage, including field-aware pooling of hidden activations and architectural modularization to insulate tool-reasoning subspaces (Healy et al., 8 Jan 2026, Yin et al., 27 Oct 2025).
- Benchmarks for cumulative and cross-modal hallucinations: Unified agent playgrounds measuring tool-use error accumulation and tracing cross-phase POMDP loops (Lin et al., 23 Sep 2025).
- Real-time verifiers and external tool checkers: Formal integration of runtime validation layers, as well as continual self-evolution to adapt to changing tool schemas (Xu et al., 2024, Lin et al., 23 Sep 2025).
7. Implications for Safe Agentic Systems
The proliferation of tool-use hallucinations in LLM-based agents demonstrates the inadequacy of scaling reasoning capabilities in isolation. Tool-enabled gains in outcome accuracy can mask brittle reasoning and unsafe execution, especially in safety-critical or open-ended environments. The current empirical and mechanistic evidence underscores the necessity for jointly aligned objectives, modular agent architectures, and multi-layered detection and gating to ensure agents act only within the boundaries of verifiable, executable, and contextually relevant tools (Yin et al., 27 Oct 2025, Xu et al., 2024, Bayat et al., 14 Nov 2025). Tool-use hallucinations thus remain a central reliability bottleneck for the deployment of robust, transparent, and trustworthy LLM-based agentic systems.