- The paper introduces a novel goal-driven taxonomy that categorizes 135 RCA studies based on seven fundamental objectives.
- The paper demonstrates the use of advanced methods such as heuristic pruning and dimensionality reduction to improve diagnostic accuracy.
- The paper discusses future directions including unified causal graph models and integration of LLMs to enable comprehensive incident management.
A Goal-Driven Survey on Root Cause Analysis
Root Cause Analysis (RCA) has become essential in managing incidents within large-scale cloud services and microservices architectures. However, prior surveys of RCA have traditionally organized related work by input data types, such as metrics versus traces, which inadequately reflects the goal-driven nature of these tasks and the diversity of objectives in the field. This paper presents a framework to categorize 135 papers on RCA, focusing not on input data types but on the inherent goals of these studies, thus providing a structured overview of the RCA landscape.
Framework and Taxonomy of RCA Goals
Seven Fundamental Goals
By analyzing the key challenges of RCA, seven fundamental goals that an ideal RCA system should achieve are identified:
- Multi-dimensional Data Correlation: Fusing diverse telemetry data into a coherent analysis.
- Robustness: Ensuring the system functions effectively with noisy and incomplete data.
- Adaptive Learning: Allowing models to dynamically evolve in response to changing system architectures or workloads.
- Real-time Performance: Prioritizing computational efficiency to enable RCA during live incidents.
- Interpretability: Generating results that are understandable and meaningful to human operators.
- Multi-granularity: Enabling precise fault localization across various levels of abstraction.
- Actionability: Converting diagnostic findings into practical remedial actions.
Figure 1: The structure of this survey. We first introduce the background, our formal framework, and survey methodology.
Survey Methodology
The survey adopts a comprehensive approach to categorize and analyze the literature. Papers spanning from 2014 to 2025 were selected based on their alignment with RCA goals. This analysis includes publication trends, research venues, and the scope of RCA research, emphasizing the distinctiveness of a goal-driven taxonomy.
Computational Optimization
- Heuristic Pruning: Methods like MicroHECL and TraceDiag reduce analysis complexity using domain-specific heuristics.
- Dimensionality Reduction: Techniques are applied to focus on the most indicative metrics or logs, enabling efficient processing.
Efficient Algorithmic Design
- Implementation of low-complexity algorithms designed for fast computation, as demonstrated by ShapleyIQ and Minesweeper.
Architectural Acceleration
- Leveraging parallel and distributed architectures to reduce latency, as seen in systems like TraceContrast and FacGraph.
Key Observations
Insights and Findings
- A significant portion of current research still focuses on pinpointing a single root cause rather than constructing comprehensive propagation graphs, highlighting the gap between current practices and the ideal RCA objectives.
- The increasing adoption of LLMs in RCA represents a shift towards achieving interpretability and actionability, advancing from basic fault localization to more meaningful diagnostics.
Figure 2: Procedure of incident management. Starting from incident preparation, involving techniques like software testing, canary releases, and disaster recovery simulations to closely mimic real-world conditions.
Discussion and Future Work
Bridging the Gap
To transition from pinpointing root causes to generating comprehensive propagation graphs, significant advancements are needed in three main areas:
- Development of Rich Benchmarking Datasets: Datasets should include complete incident propagation graphs to facilitate model training and evaluation.
- Unified Models for Causal Graph Generation: Integrate machine learning approaches, such as GNNs and LLMs, to move towards creating coherent propagation paths.
- Integration with the Software Engineering Lifecycle: Establish feedback loops where RCA outputs directly inform code changes, improvements, and architectural decisions.
Conclusion
This survey not only highlights current advancements but also strategically outlines future directions in RCA research. By advocating for a goal-driven perspective, we provide actionable insights for advancing RCA practices beyond isolated fault localization towards fully comprehensive and automated incident management systems.