Papers
Topics
Authors
Recent
Search
2000 character limit reached

Exception Taxonomy in Agentic Artifacts

Updated 14 February 2026
  • Exception taxonomy in agentic artifacts is a structured scheme identifying multi-phase failures, including reasoning, planning, and execution errors.
  • The TRAIL framework focuses on phase-based error attribution with detailed annotations, revealing a high prevalence of reasoning errors.
  • The SHIELDA taxonomy maps exceptions to specific artifacts to enable targeted diagnosis and practical recovery strategies.

Agentic artifacts—systems composed of LLMs and external tools executing complex workflows—routinely experience exceptions that span reasoning, planning, coordination, and execution phases. Exception taxonomies provide formal schemes to classify, analyze, and address such failures, enabling systematic debugging, tooling, and empirical evaluation. Two leading efforts, TRAIL and SHIELDA, have introduced detailed taxonomies for exception analysis in agentic workflows, structured around cognitive, architectural, and operational axes (Deshpande et al., 13 May 2025, Zhou et al., 11 Aug 2025).

1. The Motivating Complexity of Agentic Exceptions

Exception analysis in agentic artifacts is uniquely challenging due to their multi-phase, multi-tool, and often multi-agent nature. Unlike traditional software, where exceptions can often be localized to a line of code or explicit stack trace, agentic traces intertwine LLM cognition (reasoning steps, prompt engineering), distributed orchestration (subtasks, memory, context), and system-level operations (tool APIs, external environments). The interplay among these layers introduces new failure modes and necessitates robust, scalable taxonomies to support annotation, diagnosis, and recovery (Deshpande et al., 13 May 2025). A robust taxonomy must capture not only classic execution errors but also reasoning mistakes, planning flaws, and cross-artifact interactions (Zhou et al., 11 Aug 2025).

2. The TRAIL Taxonomy: Hierarchy and Error Classes

TRAIL partitions agentic exceptions into three root categories: Reasoning Errors (R), Planning & Coordination Errors (P), and System Execution Errors (S), collectively covering all observed failure modalities:

E=RPS,RP=RS=PS=E = R \cup P \cup S,\quad R \cap P = R \cap S = P \cap S = \emptyset

1. Reasoning Errors (R): Internal cognitive failures of the LLM, encompassing:

  • Hallucinations (language-only and tool-related)
  • Information Processing (poor retrieval, misinterpretation)
  • Decision Making (incorrect problem identification, tool selection errors)
  • Output Generation (syntactic formatting errors, instruction non-compliance)

2. Planning & Coordination Errors (P): Failures in managing context, workflow, and subtask orchestration, including:

  • Context Handling Failures and Resource Abuse
  • Goal Deviation and Task Orchestration Errors

3. System Execution Errors (S): Failures arising from interaction with the external environment, such as:

  • Configuration Issues (tool definition, environment setup)
  • API and System Issues (HTTP error codes, service failures)
  • Resource Management (exhaustion, timeouts)

Each root is subdivided into fine-grained types, for example, output generation errors distinguish invalid structured outputs from instruction non-compliance. See the table below for an overview:

Category Subcategory Example
Reasoning (R) Tool-Related Hallucination Fictitious tool output field
Planning & Coordination (P) Task Orchestration Error Patch generated before file read
System Execution (S) API Rate Limiting (429) Search API overloaded due to polling

These categories are operationalized in the TRAIL dataset, consisting of 148 traces (spanning both single- and multi-agent systems), yielding 841 annotated errors (Deshpande et al., 13 May 2025). Approximately 70% of errors fall into Reasoning, 20% into Planning & Coordination, and 8% into System Execution, with Output Generation (a Reasoning subcategory) contributing 42% of all errors. Notably, less frequent errors such as API/service failures are often high impact despite low prevalence.

3. The SHIELDA Taxonomy: Artifact-Centric Exception Enumeration

SHIELDA introduces an orthogonal, artifact-driven taxonomy over 12 agentic workflow components (“artifacts”), each paired with one or more exception types (36 total), and each exception tagged by workflow phase: Reasoning/Planning (RP), Execution (E), or both (RP/E). The formal structure is:

A={a1,,a12},E(a)={ea,1,,ea,na},ϕ:e{RP,E,RP/E}A = \{a_1, \ldots, a_{12}\}, \quad E(a) = \{e_{a,1}, \ldots, e_{a,n_a}\}, \quad \phi: e \mapsto \{\mathrm{RP},\mathrm{E},\mathrm{RP/E}\}

with artifacts including Goal, Context, Reasoning, Planning, Memory, Knowledge Base (KB), Model, Tool, Interface, Task Flow, Other Agent, and External System.

Key exception types include:

  • Goal: Ambiguous Goal (underspecified intent), Conflicting Goal
  • Memory: Poisoning, Outdated Memory, Misaligned Recall
  • Model: Token Limit Exceeded, Output Validation Failure, Output Handling Exception
  • Tool: Tool Invocation Exception, Tool Output Exception, Unavailable Tool
  • Other Agent: Communication Exception, Agent Conflict, Role Violation

A fragment of SHIELDA’s taxonomy table:

Artifact Exception Type Phase
Goal Ambiguous Goal RP
Model Output Validation Failure E
Tool Tool Invocation Exception E
Other Agent Agent Conflict E

Each type is accompanied by a definition, illustrative example, and identified root cause. For instance, “Memory Poisoning” (RP/E) refers to malicious/misleading entries in the memory store, e.g., adversarial-crafted demonstrations causing unsafe plans, typically arising from unfiltered writes or adversarial input. Exception modalities in SHIELDA span intent ambiguity, context loss, model syntax errors, inter-agent protocol violations, and external attacks (Zhou et al., 11 Aug 2025).

4. Comparative Structure, Granularity, and Coverage

TRAIL adopts a phase-oriented hierarchy (Reasoning, Planning, Execution), with subtypes reflecting error process and agent cognition, suited for span-level annotation and empirical prevalence analysis. SHIELDA, in contrast, employs an artifact-phase matrix, enabling attribution of exceptions to specific architectural components and phases. SHIELDA’s granularity (36 types across 12 artifacts) facilitates fine-grained root-cause classification, essential for linking exception handling to concrete workflow state and recovery strategies.

A plausible implication is that TRAIL excels in evaluating model-centric and cognitive error rates, especially in open-domain reasoning, while SHIELDA is tailored for cross-phase diagnostics and artifact-scoped handling patterns—each serving distinct but complementary purposes in end-to-end agentic workflow evaluation.

5. Statistical Prevalence and Empirical Impact

Within TRAIL’s 148 real-world traces (GAIA and SWE-Bench), the dataset annotates 841 errors (average 5.68 per trace, median 5), with high inter-annotator agreement at the span level (95%). Reasoning dominates error classes, but low-prevalence categories (API failures, resource exhaustion) exhibit disproportionate impact and thus are critical detection targets (Deshpande et al., 13 May 2025). SHIELDA does not present population statistics but provides artifact-by-artifact exception mapping, supporting targeted handling and escalation strategies (Zhou et al., 11 Aug 2025).

Impact analysis in TRAIL reveals that, although Output Generation errors are most common, approximately 44% are low impact; conversely, rare categories such as authentication or resource outages are often high impact—motivating both frequency- and severity-aware evaluation frameworks.

6. Applicability Across Domains and Systems

Both taxonomies demonstrate extensibility across workflow settings:

  • Single-Agent Systems (e.g., SWE-Bench): System Execution and Output Generation errors dominate, particularly in software engineering and code patching tasks.
  • Multi-Agent Systems (e.g., GAIA): Planning & Coordination failures (task orchestration, context handling) are frequent—often as agents hand off or synchronize subtasks.
  • Tool-Augmented Reasoning: Hallucinations may arise as either language-only or tied to external tool invocation; API and protocol issues surface in tool interactions.

Domains empirically covered include software engineering (code editing, patch creation) and open-world information retrieval (web search, document inspection). Every exception root and artifact appears in both single- and multi-agent settings; only relative frequencies differ (Deshpande et al., 13 May 2025, Zhou et al., 11 Aug 2025).

7. Formal Representation, Taxonomy Tables, and Figures

TRAIL formalizes its hierarchy as:

  • R={R1a,R1b,R2a,R2b,R3a,R3b,R4a,R4b}R = \{R_{1a}, R_{1b}, R_{2a}, R_{2b}, R_{3a}, R_{3b}, R_{4a}, R_{4b}\}
  • P={P1a,P1b,P2a,P2b}P = \{P_{1a}, P_{1b}, P_{2a}, P_{2b}\}
  • S={S1a,S1b,S2a,S2b,S2c,S2d,S3a,S3b}S = \{S_{1a}, S_{1b}, S_{2a}, S_{2b}, S_{2c}, S_{2d}, S_{3a}, S_{3b}\}

SHIELDA's structure is succinctly captured as a mapping AE(a)A \to E(a) with each eE(a)e \in E(a) tagged with phase ϕ(e)\phi(e). Table representations in both frameworks organize artifacts and exceptions along with phase labels, supporting transparent annotation and downstream handling logic (Deshpande et al., 13 May 2025, Zhou et al., 11 Aug 2025). TRAIL's Figure 1 and SHIELDA's Table 1 provide canonical visualizations of these structures.


Exception taxonomies in LLM-driven agentic artifacts enable multi-granular failure localization, facilitate reproducible trace annotation, and underpin structured exception handling and recovery strategies. The TRAIL and SHIELDA frameworks together define the current state-of-the-art for exception analysis in this domain (Deshpande et al., 13 May 2025, Zhou et al., 11 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Exception Taxonomy across Agentic Artifacts.