Papers
Topics
Authors
Recent
Search
2000 character limit reached

Context Retrieval & Bug Detection Agents

Updated 29 January 2026
  • Context retrieval and bug detection agents are specialized AI systems that extract semantically relevant code slices using techniques like PDG slicing and embedding to enhance bug localization and repair.
  • They employ modular, agent-based architectures where components such as context slicers, LLM bug checkers, and graph encoders collaborate for precise detection and automated fix suggestions.
  • Empirical evaluations reveal that dual slicing and hybrid retrieval methods substantially improve accuracy and efficiency, underscoring the importance of principled context engineering.

Context retrieval and bug detection agents are specialized AI-driven systems designed to automate, enhance, and augment the identification, localization, and repair of software bugs. These agents combine advanced information retrieval, machine learning, and program analysis techniques to distill semantically relevant context from vast codebases or bug report archives, providing precise substrate for downstream tasks such as patch generation, triage, or clarification. Recent research emphasizes agentic workflows—modular, interacting components orchestrated for robust performance and scalability—with strong theoretical and empirical evidence that principled context engineering is central to the effectiveness of both bug detection and automated repair.

1. Formal Foundations: Context, Slicing, and Program Dependence

The effectiveness of program repair and bug detection is tightly linked to the quality and relevance of the context—i.e., the subset of code analyzed in relation to a candidate bug. The "Katana" system (Sintaha et al., 2022) formalizes this via program slicing. Given a program PP with set of statements NN, a Program Dependence Graph (PDG) =(N,Ec∪Ed)= (N, E_c \cup E_d) is constructed, where EcE_c encodes control dependencies (statements governed by the predicates of others) and EdE_d encodes data dependencies (statements linked by variable definitions and uses). For a program point pp and variable set VV, the backward slice,

BSlice(p,V)={n∈N∣∃  path n→⋯→p in PDG}\mathit{BSlice}(p,V) = \{n \in N \mid \exists \;\text{path } n\to\cdots\to p \text{ in PDG}\}

systematically captures just those statements potentially affecting VV at pp. Katana introduces "dual slicing," the union of backward slices from the buggy and fixed statements, isolating the minimal context necessary to explain and learn the causality of a bug and its repair.

Similarly, BugScope applies interprocedural dependency-driven slicing, but learns, from labeled (buggy, non-buggy) example pairs, which program constructs ("seeds") and slice direction (forward or backward) maximize the separation between positive and negative cases. This learned retrieval strategy ensures that the extracted context consistently covers the pattern-defining dependence chains for complex bug classes (Guo et al., 21 Jul 2025).

2. Architectures of Agent-Based Context Retrieval and Bug Detection

Agentic designs are characterized by pipelines of specialized modules ("agents") communicating via shared state and explicit reasoning traces:

System Key Agents/Components Retrieval Mechanism
Katana Context slicer, GNN encoder, repair op Dual program slicing (PDG-based)
BugScope Retrieval strategy learner, LLM bug checker Pattern-adaptive program slicing
CogniGent Restructuring, retrieval, filtering, hypothesis, supervisor, explorer, observer IR+call-graph+LLM-causal reasoning
FixAgent Context constructor, localizer, repairer, revisitor, crafter Breadth-first code+documentation harvesting
LangGraph+ChromaDB Multi-node graph, LLM agents Vector embedding–based RAG from error logs
GenLoc LLM core, code exploration tool suite Embedding-guided file retrieval
AEGIS Searcher, issue field extractor, code fragment explainer BM25/embedding+LLM annotation

Agents perform context distillation (via slicing, embedding-based retrieval, or reformulation), root-cause and hypothesis testing (call-graph traversal), rank-aggregation, and subsequent bug detection/repair (LLM, GNN, or discriminative models). Almost all modern agents employ an explicit or implicit feedback loop enabling self-correction, confirmation, and iterative improvement. FSM-driven workflow regulation, as in AEGIS, further constrains action sequences and supports multi-dimensional feedback integration (Wang et al., 2024).

3. Retrieval Techniques: From Program Slicing to Dense Embedding

Retrieval mechanisms in state-of-the-art agents span a spectrum:

A. PDG-Based Program Slicing:

  • Katana uses AST/PDG slicing to extract the union of code lines that control or are data-dependent with respect to the bug and fix site (Sintaha et al., 2022).
  • BugScope learns the slice criterion and direction, driven by example anti-patterns, and applies interprocedural slicing to retrieve the minimal code necessary to trigger and recognize a bug (Guo et al., 21 Jul 2025).
  • The outcome is a drastic reduction in context size (e.g., Katana: mean lines per context 39 → 15) with a corresponding improvement in model accuracy.

B. Embedding and Vector Search:

C. Query Reformulation and Context Engineering:

  • LLM-powered extraction of identifiers, snippets, and stack traces feeds structured queries to BM25, as in the agents of (Caumartin et al., 7 Dec 2025) and BLIZZARD (Rahman et al., 2018).
  • CogniGent (Samir et al., 18 Jan 2026) combines BM25 with call-graph influence (Pagerank) and LLM-based relevance scoring; agents manage dynamic windows and causal chains, pruning low-confidence branches.

4. Downstream Bug Detection and Repair Workflows

Upon assembling the salient context, agents instantiate detection and repair models:

  • Graph Neural Networks (GNNs): Katana encodes code slices as AST graphs (with specialized value/call links) input to multi-layer GNNs, followed by classification or graph-edit prediction. Dual slicing achieves top-3 accuracy of 41.95%, exceeding previous approaches by up to 3.7× (Sintaha et al., 2022).
  • LLM Few-Shot Reasoning: BugScope synthesizes chain-of-thought prompt templates from exemplars, instantiates them per retrieved slice, and validates each bug candidate via LLM reasoning and lightweight reflection. Precision and recall both exceed 87% (Guo et al., 21 Jul 2025).
  • Causal Reasoning Agents: CogniGent conducts hypothesis generation and call-graph DFS exploration, emulating human debugging (Click2Cause). Hypotheses are recursively tested, expanded, or pruned, with Observer final validation. This delivers MAP improvements up to 38.6% and MRR gains of 53.7% versus prior agents (Samir et al., 18 Jan 2026).
  • Hybrid Retrieval+Classification: In duplicate bug report detection, retrieval (SBERT) rapidly narrows candidate space, followed by RoBERTa for high-precision pairwise classification, achieving comparable accuracy at <10% of classification-only latency (Meng et al., 2024).

5. Evaluation Metrics and Comparative Performance

Systems are rigorously evaluated through:

Metric Definition
MAP@k Mean Average Precision at rank kk: reward for correct ranking of relevant files
MRR Mean Reciprocal Rank: average inverse rank of first relevant prediction
Hit@k Fraction of bugs with any correct file in top kk predictions
F1, Precision Standard in detection: F1 = $2PR/(P+R)$, contextual recall/precision for slices
Bug Repair Top-kk exact match of predicted patch to ground-truth fix (Katana)

Experimental benchmarks, including Defects4J, SWE-bench, SIR, Long Code Arena, and benchmarks from [Ye et al.], confirm substantial gains of agentic and context-slicing methods (e.g., Katana: top-1 repair rate 28.31% vs. prior SOTA 24.67%; GenLoc: Accuracy@1 43.19% vs. best baseline 26.84%) (Sintaha et al., 2022, Asad et al., 1 Aug 2025).

6. Broader Methodological Implications and Limitations

The empirical consensus across multiple domains is that semantically precise, noise-minimized context retrieval underpins the success of modern bug detection and repair agents. Principled slicing (Katana, BugScope), intelligent embedding and search (CogniGent, GenLoc), and context-aware reformulation (BLIZZARD, (Caumartin et al., 7 Dec 2025, Mukherjee et al., 2023)) each contribute to winning the size/noise vs. recall/precision trade-off.

Key points:

  • Program slicing guarantees semantic relevance, but incurs PDG construction cost (∼0.5 s per example); embedding-based methods scale better, but may lose structural details (Sintaha et al., 2022, Wang et al., 29 Jan 2025).
  • Learned context strategies (BugScope) systematically outperform naive windows, especially for subtle anti-patterns (e.g., variable misuse, non-local OOB).
  • Limitations include dependency on LLM extraction quality and context window limits, with diminishing returns at high kk in IR-based retrieval (Caumartin et al., 7 Dec 2025).
  • Task decoupling and workflow regulation (as in AEGIS, via FSMs) help prevent context drift and uncontrolled agent actions (Wang et al., 2024).

7. Outlook: Future Directions and Research Challenges

Emerging research themes include:

A plausible implication is that as agentic debugging systems mature, the separation of context retrieval and bug reasoning—coupled with adaptive, explainable, and regulated agent workflows—will define both the scalability and quality ceiling of automated program analysis, repair, and triage. Semantically rigorous context engineering is the principal enabler of this next generation of bug detection and repair agents.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Context Retrieval and Bug Detection Agents.