Papers
Topics
Authors
Recent
Search
2000 character limit reached

A Goal-Driven Survey on Root Cause Analysis

Published 22 Oct 2025 in cs.SE and cs.AI | (2510.19593v1)

Abstract: Root Cause Analysis (RCA) is a crucial aspect of incident management in large-scale cloud services. While the term root cause analysis or RCA has been widely used, different studies formulate the task differently. This is because the term "RCA" implicitly covers tasks with distinct underlying goals. For instance, the goal of localizing a faulty service for rapid triage is fundamentally different from identifying a specific functional bug for a definitive fix. However, previous surveys have largely overlooked these goal-based distinctions, conventionally categorizing papers by input data types (e.g., metric-based vs. trace-based methods). This leads to the grouping of works with disparate objectives, thereby obscuring the true progress and gaps in the field. Meanwhile, the typical audience of an RCA survey is either laymen who want to know the goals and big picture of the task or RCA researchers who want to figure out past research under the same task formulation. Thus, an RCA survey that organizes the related papers according to their goals is in high demand. To this end, this paper presents a goal-driven framework that effectively categorizes and integrates 135 papers on RCA in the context of cloud incident management based on their diverse goals, spanning the period from 2014 to 2025. In addition to the goal-driven categorization, it discusses the ultimate goal of all RCA papers as an umbrella covering different RCA formulations. Moreover, the paper discusses open challenges and future directions in RCA.

Summary

  • The paper introduces a novel goal-driven taxonomy that categorizes 135 RCA studies based on seven fundamental objectives.
  • The paper demonstrates the use of advanced methods such as heuristic pruning and dimensionality reduction to improve diagnostic accuracy.
  • The paper discusses future directions including unified causal graph models and integration of LLMs to enable comprehensive incident management.

A Goal-Driven Survey on Root Cause Analysis

Root Cause Analysis (RCA) has become essential in managing incidents within large-scale cloud services and microservices architectures. However, prior surveys of RCA have traditionally organized related work by input data types, such as metrics versus traces, which inadequately reflects the goal-driven nature of these tasks and the diversity of objectives in the field. This paper presents a framework to categorize 135 papers on RCA, focusing not on input data types but on the inherent goals of these studies, thus providing a structured overview of the RCA landscape.

Framework and Taxonomy of RCA Goals

Seven Fundamental Goals

By analyzing the key challenges of RCA, seven fundamental goals that an ideal RCA system should achieve are identified:

  1. Multi-dimensional Data Correlation: Fusing diverse telemetry data into a coherent analysis.
  2. Robustness: Ensuring the system functions effectively with noisy and incomplete data.
  3. Adaptive Learning: Allowing models to dynamically evolve in response to changing system architectures or workloads.
  4. Real-time Performance: Prioritizing computational efficiency to enable RCA during live incidents.
  5. Interpretability: Generating results that are understandable and meaningful to human operators.
  6. Multi-granularity: Enabling precise fault localization across various levels of abstraction.
  7. Actionability: Converting diagnostic findings into practical remedial actions. Figure 1

    Figure 1: The structure of this survey. We first introduce the background, our formal framework, and survey methodology.

Survey Methodology

The survey adopts a comprehensive approach to categorize and analyze the literature. Papers spanning from 2014 to 2025 were selected based on their alignment with RCA goals. This analysis includes publication trends, research venues, and the scope of RCA research, emphasizing the distinctiveness of a goal-driven taxonomy.

Computational Optimization

  • Heuristic Pruning: Methods like MicroHECL and TraceDiag reduce analysis complexity using domain-specific heuristics.
  • Dimensionality Reduction: Techniques are applied to focus on the most indicative metrics or logs, enabling efficient processing.

Efficient Algorithmic Design

  • Implementation of low-complexity algorithms designed for fast computation, as demonstrated by ShapleyIQ and Minesweeper.

Architectural Acceleration

  • Leveraging parallel and distributed architectures to reduce latency, as seen in systems like TraceContrast and FacGraph.

Key Observations

Insights and Findings

  • A significant portion of current research still focuses on pinpointing a single root cause rather than constructing comprehensive propagation graphs, highlighting the gap between current practices and the ideal RCA objectives.
  • The increasing adoption of LLMs in RCA represents a shift towards achieving interpretability and actionability, advancing from basic fault localization to more meaningful diagnostics. Figure 2

    Figure 2: Procedure of incident management. Starting from incident preparation, involving techniques like software testing, canary releases, and disaster recovery simulations to closely mimic real-world conditions.

Discussion and Future Work

Bridging the Gap

To transition from pinpointing root causes to generating comprehensive propagation graphs, significant advancements are needed in three main areas:

  1. Development of Rich Benchmarking Datasets: Datasets should include complete incident propagation graphs to facilitate model training and evaluation.
  2. Unified Models for Causal Graph Generation: Integrate machine learning approaches, such as GNNs and LLMs, to move towards creating coherent propagation paths.
  3. Integration with the Software Engineering Lifecycle: Establish feedback loops where RCA outputs directly inform code changes, improvements, and architectural decisions.

Conclusion

This survey not only highlights current advancements but also strategically outlines future directions in RCA research. By advocating for a goal-driven perspective, we provide actionable insights for advancing RCA practices beyond isolated fault localization towards fully comprehensive and automated incident management systems.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 2 likes about this paper.

alphaXiv