Automated Failure Analysis & Taxonomy
- Automated failure analysis and taxonomy is the systematic, algorithm-driven investigation and classification of system failures using techniques like specification mining and statistical inference.
- It integrates model-based reasoning, machine learning, natural language processing, and multimodal embedding alignment to efficiently detect and elucidate failure patterns.
- This approach facilitates the creation of detailed fault taxonomies that enhance diagnostics, optimize test coverage, and improve reliability as evidenced by high precision and recall in industrial studies.
Automated failure analysis and taxonomy refer to the systematic, algorithm-driven investigation and classification of failures in complex engineered and computational systems. This domain integrates model-driven reasoning, statistical inference, machine learning, specification mining, and natural language processing to both identify failure events and structure them into actionable taxonomies—thus enhancing diagnostics, test coverage, reliability engineering, and safety certification across diverse settings.
1. Foundations and Core Methodologies
Automated failure analysis leverages model-based, data-driven, and logical techniques to reduce the human effort required for root cause determination and classification. Core methodologies include:
- Specification Mining and Trace Analysis: Extraction of invariants—typically in expressive logics such as Signal Temporal Logic (STL)—from passing executions, followed by violation mining on failing traces. CPSDebug exemplifies this, employing STL-based mining and clustering to generate failure explanations in Simulink/Stateflow cyber-physical system (CPS) models (Bartocci et al., 2019).
- Probabilistic and Statistical Reasoning: Utilization of probabilistic models (e.g., Hidden Markov Models—HMMs) and information-theoretic criteria (e.g., KL-divergence suspiciousness metrics) to rank fault candidates. For example, in model checking, transitions are prioritized according to the Kullback-Leibler divergence between empirical error-trace distributions (for failures) and correct-trace distributions (for normal operations), while HMMs are used to compute confidence degrees in component-level simulation models (Ge et al., 2016).
- Machine-Learned Taxonomies and Fault Trees: Pipeline approaches that construct explainable fault trees from sensor data, discretizing continuous signals (e.g., via C4.5-style thresholding) and learning Boolean relationships and A-Nodes using greedy search and correlation metrics (e.g., ϕ-coefficient in LIFT) (Verkuil et al., 2022).
- Natural Language and Log-based Methods: Text and log abstraction, embedding, clustering, and regex/rule-based scoring for both grouping and categorizing log-derived failure causes (Abbas et al., 2023, Gao et al., 2024, Aïdasso, 16 Mar 2025). Large-scale log-based systems such as NCChecker parse industrial logs into structured events, score them against lookup tables normalized for class-imbalance, and provide interpretable predictions.
- Multimodal and Embedding Alignment: Recent frameworks align numerical time series (e.g., telemetry) with LLM latent spaces via cross-attention and semantic compression, enabling language-model-driven diagnostic reasoning over cloud or hybrid failure datasets (Park, 8 Jan 2026).
- Workflow and Iteration Analysis: Agentic and collaborative architectures for iterative error diagnosis—combining executor and expert agents for strategic oversight and course correction, informed by empirical taxonomies derived from large human-annotated datasets (Liu et al., 17 Sep 2025, Ma et al., 28 Sep 2025).
2. Taxonomy Construction and Structures
Failure taxonomies in automated analysis serve two purposes: guiding automated classification and structuring domain knowledge for testing and reporting. Principled taxonomy schemes are being used across several domains:
- Hierarchical CPS Fault Classes: Faults are grouped into sensor-layer (offset, drift, noise), actuator-layer (stuck-at, delay, saturation), logic/control (stateflow guard errors, missing transitions), and modeling/parameter (wrong initializations, unit mismatches). These classes support both fine-grained (block-level) and coarse-grained (fault class) evaluation (Bartocci et al., 2019).
- GMF Ontology for AI Failures: The “Goals–Methods–Failure-Causes” (GMF) structure provides a three-layer directed ontology—goal G, method M, failure cause F—annotated per incident and augmented with binary relations R_{GM}, R_{MF} (Pittaras et al., 2022).
- Flaky Job Failure Categories: In CI/CD environments, a stable 46-class taxonomy has been developed, covering timeout errors, network glitches, disk space failures, container errors, flaky assertions, and more, all mapped with regexes for automated log labeling (Aïdasso, 16 Mar 2025).
- LLM Application Failures: A 15-mode system-level taxonomy divides hidden LLM failures into Reasoning, Input & Context, and System & Operational layers (e.g., hallucinations, multi-step collapse, tool/API breakdown, version drift, cost-driven collapse) (Vinay, 25 Nov 2025).
- Agentic System Root Causes: For orchestrated LLM-based agent systems, failures are classified at agent (tool planning, prompt design), workflow (validation, dependencies, deadlocks), and platform (resource, service) levels (Ma et al., 28 Sep 2025).
- Postmortem and System Taxonomies from Open-Source News: Structured extraction of postmortem factors (timeline, impacted org, cause, fix, impact), Avizienis et al.’s fault axes (phase, boundary, nature, etc.), and system characteristics (IT/CPS layers, domain, consequence) enables aggregate analysis across thousands of public failures (Anandayuvaraj et al., 2024).
The following table aggregates representative taxonomy structures across major application areas:
| Application Domain | Taxonomy Structure | Key Reference |
|---|---|---|
| CPS/Simulink/Stateflow | Sensor/Actuator/Logic/Modeling fault classes | (Bartocci et al., 2019) |
| Safety-Instrumented Sys. | Atomic input deviation → SIF-level DU/ST | (Jahanian et al., 2020) |
| Software Test Logs | Bug, Environment, Test, Third-party issue | (Gao et al., 2024) |
| CI/CD Flaky Jobs | 46 regex-defined error categories | (Aïdasso, 16 Mar 2025) |
| AI/ML/LLM Applications | Reasoning/Input/System (15 hidden modes) | (Vinay, 25 Nov 2025) |
| Agentic LLM Workflows | Agent/Workflow/Platform root cause | (Ma et al., 28 Sep 2025) |
| News/Postmortems | Postmortem, Fault, and System Characteristic | (Anandayuvaraj et al., 2024) |
3. Automated Pipelines and Evaluation
State-of-the-art automated failure analysis operates as a sequence of well-defined algorithmic stages:
- Input/Log Processing: Massive log or trace data is abstracted, cleaned, or segmented (NCChecker's log event abstraction; BERT/FastText/TF-IDF vectorization for LogGrouper).
- Feature Extraction & Invariant Mining: Invariants or predicates (e.g., STL formulas, Boolean features from C4.5 thresholding) are mined from nominal data.
- Violation/Pattern Matching: Candidate anomalies are ranked by exclusivity to failing instances, correlation with test outcomes, or alignment with known error signatures.
- Clustering & Grouping: Incidents or logs are clustered (DBSCAN, Agglomerative, Spectral) for taxonomy emergence and workload reduction (Abbas et al., 2023).
- Taxonomy-Guided Classification: Each failure or incident is mapped to its category using lookup tables, learned classifiers, or LLM-based decision routines.
- Propagation & Explanation: Explanatory traces are calculated via back-propagation in model graphs, providing root-cause localization and temporal-fault tracing (Bartocci et al., 2019).
- Retrieval and Multimodal Reasoning: Aligned embeddings connect time-series signatures to historical incident retrieval—in frameworks like TimeRAG (Park, 8 Jan 2026).
Performance metrics for automated approaches include precision, recall, and F1 for cause-prediction, silhouette/C-H indices for clustering quality, user study metrics, and timing/resource usage for scalability. For example, in industrial deployment, clustering pipelines like LogGrouper achieved normalized silhouette coefficients around 0.49, while domain experts confirmed qualitative utility for root-cause triage (Abbas et al., 2023). NCChecker reports macro F1 of 71.9% and high efficiency (O(1) cause-prediction per event) (Gao et al., 2024).
4. Empirical Findings and Practical Impact
Automated analyses yield actionable insights into system vulnerabilities and failure patterns:
- Coverage and Precision: CPSDebug demonstrated 100% detection of injected faults, with the true fault class ranked top-1 in 82% of runs, precision 0.79, and recall 0.87 (Bartocci et al., 2019). In large-scale log abstractions, tools like FAIL attained F1 scores of 0.90 for news-incident classification and V-measure ≥0.98 for incident clustering (Anandayuvaraj et al., 2024).
- Taxonomy Granularity and Distribution: Targeted analyses reveal failure class skew (e.g., job failures are dominated by TestTimeout, NetworkGlitch, DiskFull, but cost clusters—Hot & Costly—may be concentrated in only a few categories (Aïdasso, 16 Mar 2025)).
- Phase-wise Failure Patterns: For automated code repair agents, phase-specific taxonomies (localization, repair, iteration/validation) expose systematic weaknesses, e.g., cognitive deadlocks and mislocalized patches in agentic architectures (Liu et al., 17 Sep 2025).
- Cost and Business Analytics: RFM (recency, frequency, monetary) scoring enables DevOps teams to triage failures by their real business impact, optimizing the response effort (Aïdasso, 16 Mar 2025).
- Iterative and Collaborative Correction: Architectures embedding expert-review agents or progressive taxonomic annotation can lift diagnostic and repair rates substantially (e.g., Expert–Executor models solving 22.2% of previously intractable issues) (Liu et al., 17 Sep 2025).
5. Strengths, Limitations, and Future Directions
Strengths:
- Automation enables root-cause localization and taxonomy-driven reporting at scale, reducing manual engineering labor (CPSDebug, FMR, NCChecker).
- Hierarchical and data-driven taxonomies raise situational awareness, inform test-case design, and prioritize risk mitigation.
Limitations:
- Scalability remains a challenge for invariant mining on large/high-dimensional models (Bartocci et al., 2019).
- Test suite diversity and log/event coverage are critical for high recall and to avoid spurious invariants or false positives (Bartocci et al., 2019).
- Robust automation of multi-agent, LLM-driven workflows sees a performance ceiling: best taxonomy-guided LLM benchmarks reach only 33.6% root-cause identification accuracy, highlighting the ongoing difficulty of fine-grained tracing in complex, nested systems (Ma et al., 28 Sep 2025).
Directions for Further Advancement:
- Extension of core mining and reasoning techniques to fully incremental or online learning settings.
- Development of formally verified foundational algorithms for certification-critical domains (e.g., FMR backward elimination (Jahanian et al., 2020)).
- Integration with real-time observability, cost tracking, and semantic performance monitoring (as recommended for LLM deployments (Vinay, 25 Nov 2025)).
- Community augmentation and crowd-sourcing for open, dynamic taxonomy expansion in rapidly evolving domains (AIID, open AI safety incidents (Pittaras et al., 2022)).
6. Domain-Specific Variants and Case Studies
Automated failure analysis and taxonomy are instantiated differently across fields:
- CPS and Safety-Critical Systems: Emphasis on invariant mining, fault-propagation rules, Boolean/quantitative cut-set analysis, integration with safety tools, and precise, minimal root-cause set generation (Bartocci et al., 2019, Jahanian et al., 2020).
- Cloud/DevOps: Multi-modal log and metric analysis, RAG-based diagnosis with embedding alignment, cost-driven clustering, and fine-grained label assignment (Gao et al., 2024, Aïdasso, 16 Mar 2025, Park, 8 Jan 2026).
- Software and LLM-centric Workflows: Failure-cause tracing within agentic or multi-agent systems, employing fine-grained failure-phase taxonomies and automated LLM benchmarks, with actionable guidelines for taxonomy-driven CI/CD validation (Ma et al., 28 Sep 2025, Vinay, 25 Nov 2025, Anandayuvaraj et al., 2024).
- Autonomous Vehicles and Adversarial Taxonomies: Factors internal (vehicle mechanics, software, consumables, driver) and external (lighting, weather, VRUs, infrastructure) are hierarchically classified; these drive scenario generation and corner-case coverage metrics (Saffary et al., 2024).
7. Taxonomy-Driven Automation: Prospects and Recommendations
Adoption of explicit, machine-actionable taxonomies enables:
- Systematic numerical evaluation (precision, recall, coverage, cost), facilitating real-time feedback and intervention.
- Automated root-cause triage, with escalating interventions mapped to taxonomy levels (platform, workflow, agent).
- Continuous evolution and refinement of taxonomies through clustering, retrieval augmentation, and expert feedback.
- Alignment of root-cause insight with test-case generation, safety certification, and design-for-reliability principles.
These developments collectively foster a move toward “computational debugging” and automated reliability assessment, rendering failure analysis both more rigorous and scalable across domains (Bartocci et al., 2019, Vinay, 25 Nov 2025, Anandayuvaraj et al., 2024).