The Multi-Agent Fault Localization System Based on Monte Carlo Tree Search Approach

Published 30 Jul 2025 in cs.SE | (2507.22800v1)

Abstract: In real-world scenarios, due to the highly decoupled and flexible nature of microservices, it poses greater challenges to system reliability. The more frequent occurrence of incidents has created a demand for Root Cause Analysis(RCA) methods that enable rapid identification and recovery of incidents. LLM provides a new path for quickly locating and recovering from incidents by leveraging their powerful generalization ability combined with expert experience. Current LLM for RCA frameworks are based on ideas like ReAct and Chain-of-Thought, but the hallucination of LLM and the propagation nature of anomalies often lead to incorrect localization results. Moreover, the massive amount of anomalous information generated in large, complex systems presents a huge challenge for the context window length of LLMs. To address these challenges, we propose KnowledgeMind, an innovative LLM multi-agent system based on Monte Carlo Tree Search and a knowledge base reward mechanism for standardized service-by-service reasoning. Compared to State-Of-The-Art(SOTA) LLM for RCA methods, our service-by-service exploration approach significantly reduces the burden on the maximum context window length, requiring only one-tenth of its size. Additionally, by incorporating a rule-based real-time reward mechanism, our method effectively mitigates hallucinations during the inference process. Compared to the SOTA LLM for RCA framework, our method achieves a 49.29% to 128.35% improvement in root cause localization accuracy.

Abstract PDF Upgrade to Chat

Summary

The paper presents KnowledgeMind, a novel multi-agent system for fault localization that employs MCTS to reduce LLM hallucinations.
The methodology uses a Fault Mining Tree where dedicated agents gather metrics, logs, and traces for precise root cause analysis.
Results show RCA accuracy improvements from 49.29% to 128.35% over LLM-based methods, underscoring its potential in complex microservice environments.

The Multi-Agent Fault Localization System Based on Monte Carlo Tree Search Approach

This essay provides a comprehensive analysis of the "The Multi-Agent Fault Localization System Based on Monte Carlo Tree Search Approach." The paper elaborates on the development of a sophisticated fault localization system for microservices using Monte Carlo Tree Search (MCTS), focusing particularly on its multi-agent framework. Through this approach, the authors aim to overcome limitations inherent to existing methods that rely on LLMs for root cause analysis (RCA).

Introduction

The paper addresses the challenge of identifying root causes of failures in complex microservice environments. Current systems, while incorporating LLMs, often suffer from hallucinations due to their probabilistic nature and the vast amounts of contextual data required. The microservice architecture features loosely coupled components that can create intricate dependencies and consequently propagate anomalies extensively throughout the system.

The authors propose "KnowledgeMind," a novel multi-agent system employing MCTS to conduct RCA. This approach differentiates itself by isolating understanding per microservice to minimize LLM context length dependencies and by using a reward mechanism to combat hallucination issues present in standard LLM approaches.

Framework Overview

KnowledgeMind is meticulously designed to systematically traverse the microservice architecture for fault analysis.

Figure 1: The Pipeline of KnowledgeMind.

Fault Mining and Multi-Agent System

Central to the framework is the construction of a "Fault Mining Tree" (FMT), where service dependencies are explored through MCTS. Each node on the FMT represents a service, and the edges encode the interaction paths across the microservice environment. Using agents designed for metrics, logs, and traces, the system compiles critical data in each exploration step, mitigating the room for hallucination.

The process begins with an "Anomaly Alarm Agent" to signal potential issues, followed by "Metric," "Log," and "Trace Agents" that gather indispensable information for the "Verifier Agent." This agent collaborates with a "Knowledge Base Agent" to evaluate and authenticate findings using an expert-curated rule base.

Figure 2: The Construction of Fault Mining Tree.

Monte Carlo Tree Search

Unlike previous LLM efforts reliant on holistic data feeds, MCTS intelligently decomposes the task into weighted exploration paths (Figure 3). This service-by-service exploration minimizes context length requirements by enforcing a rule-based rollout policy, ensuring the analysis remains grounded in validated knowledge from the "Knowledge Base."

Figure 3: The Procedure of Fault Reasoning Step-By-Step.

Results and Evaluation

Empirical assessments conducted on well-known datasets such as those from AIOPS 2022 establish that KnowledgeMind substantially enhances RCA accuracy, demonstrating improvements ranging from 49.29% to 128.35% over state-of-the-art LLM baselines. Figure 4 presents the robustness of this approach compared to existing methodologies like mABC and RCAgent.

Figure 4: The Overview of Microservice System.

The design leverages both supervised (with case libraries) and unsupervised rule-driven modes, ensuring adaptability across various deployment environments. The MCTS’s strategic search alleviates constraints imposed by LLM token limits, essential for scaling RCA efforts in large microservice ecosystems.

Implications and Future Work

The implementation of KnowledgeMind is poised to redefine RCA processes by emphasizing precision and reliability through an intelligent exploration system. While computationally more intensive due to its intricate verification stages, the trade-off in increased computation is justified by its accuracy and ability to function under constrained computational resources.

Future directions involve integrating more dynamic adaptability strategies into agents to improve response times and enhance their efficacy in handling evolving system complexities. Furthermore, expanding the knowledge base with real-world case studies can bolster the system’s predictive capabilities.

Conclusion

The "Multi-Agent Fault Localization System Based on Monte Carlo Tree Search Approach" stands out in its systematic dismantling of traditional LLM constraints, heralding a more efficient, reliable, and adaptable system for pinpointing root causes in microservices. As microservice architectures continue to grow in complexity, systems like KnowledgeMind will be vital in maintaining and improving system reliability.

Markdown