How and Why Agents Can Identify Bug-Introducing Commits

Published 31 Mar 2026 in cs.SE | (2603.29378v1)

Abstract: Śliwerski, Zimmermann, and Zeller (SZZ) just won the 2026 ACM SIGSOFT Impact Award for asking: When do changes induce fixes? Their paper from 2005 served as the foundation for a wide array of approaches aimed at identifying bug-introducing changes (or commits) from fix commits in software repositories. But even after two decades of progress, the best-performing approach from 2025 yields a modest increase of 10 percentage points in F1-score on the most popular Linux kernel dataset. In this paper, we uncover how and why LLM-based agents can substantially advance the state-of-the-art in identifying bug-introducing commits from fix commits. We propose a simple agentic workflow based on searching a set of candidate commits and find that it raises the F1-score from 0.64 to 0.81 on the most popular Linux kernel dataset, a bigger jump than between the original 2005 method (0.54) and the previous SOTA (0.64). We also uncover why agents are so successful: They derive short greppable patterns from the fix commit diff and message and use them to effectively search and find bug-introducing commits in large candidate sets. Finally, we also discuss how these insights might enable further progress in bug detection, root cause understanding, and repair.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper presents an innovative agentic approach using LLMs to identify bug-introducing commits, achieving an F1 improvement of at least 12 percentage points over state-of-the-art methods.
It integrates traditional SZZ filtering with LLM-based contextual analysis, combining diff and commit message insights to enhance fault localization.
Empirical evaluations on Linux kernel and GitHub projects demonstrate scalability and robust performance, paving the way for automated bug triage and root cause analysis.

Agent-Based Identification of Bug-Introducing Commits: Mechanisms, Effectiveness, and Impact

Introduction

The task of identifying bug-introducing commits (BICs) within large-scale version control repositories remains a central problem in empirical software engineering, underpinning both academic studies and practical workflows such as vulnerability triage and root cause analysis. Despite two decades of incremental improvements on the SZZ algorithm and its close relatives, performance metrics such as F1-score have plateaued, with state-of-the-art (SOTA) methods only marginally surpassing the original algorithm [sliwerski_2005_original_szz] [tang_2025_llm4szz]. The paper "How and Why Agents Can Identify Bug-Introducing Commits" (2603.29378) proposes a paradigm shift: leveraging agentic workflows powered by LLMs as autonomous actors equipped with developer-like tools (file reading, grep, code navigation) to directly search within large candidate commit sets. The paper formalizes, implements, and rigorously evaluates this approach, demonstrating a substantial advance over prior art and providing detailed ablation and failure analyses to elucidate mechanisms, limitations, and future research directions.

SZZ-Agent: Agentic Approach and Empirical Gains

The SZZ-Agent framework integrates LLM-based agents into the BIC identification pipeline. Rather than relying solely on static program analysis, blame information, or end-to-end neural ranking, SZZ-Agent couples the initial SZZ line blame mechanism with a two-stage agent-based inspection: (1) filtering candidate BICs via standard SZZ and then (2) applying an LLM agent to evaluate and select from these candidates, with the fallback of a binary search process traversing the full history when SZZ fails. Crucially, at each step, the agent is given contextualized data including diffs, messages, and local code fragments of candidate commits.

Experimental results on developer-annotated datasets derived from the Linux kernel [Lyu_2024_Linux_Kernel_Dataset] and diverse GitHub projects show a marked improvement. SZZ-Agent outperforms all SZZ variants and the previous SOTA LLM4SZZ [tang_2025_llm4szz] by at least 12 percentage points in F1-score (0.77 vs. 0.64 on DS_LINUX, with similar gains on other datasets). This leap cannot be attributed to training data leakage or mere model upgrade. Comprehensive ablations show that the agentic workflow is responsible for the effect, as SZZ-Agent using the same LLM backbone as LLM4SZZ (Claude Opus 4.5) maintains this delta.

Isolating the role of contextual inputs, experiments show that even providing only the fix commit message or only the diff enables the agent to surpass SOTA, but combining both yields the highest F1. This points to the agent’s ability to robustly infer bug-introducing patterns from partially redundant and complementary fix signals, suggesting improved fault localization and understanding compared to fixed-rule methods.

Simple-SZZ-Agent: Direct Selection over Large Search Spaces

A key empirical finding is that the binary search steps of SZZ-Agent are not necessary for SOTA performance or for cost containment. Motivated by ablation on the candidate selection threshold, Simple-SZZ-Agent omits both SZZ-based filtering and binary search—presenting the agent with the entire candidate set extracted from file histories—and directly asks for selection. Despite the expanded search space (hundreds to thousands of candidates per fix), Simple-SZZ-Agent not only matches but often exceeds SZZ-Agent, achieving up to 0.86 F1-score on modern Linux kernel fixes.

Detailed logging and analysis show that the agent predominantly interacts with file reading and grep tools; it does not exhaustively read candidates but instead synthesizes short greppable patterns (median 21 characters) by distilling the fix commit diff and message. These patterns target code fragments most likely implicated in the introduction of the bug—an approach that subsumes SZZ’s line-tracing while being flexible enough to handle pure-addition fixes (unreachable to traditional SZZ). This mechanism allows the agent's cost and token usage to remain nearly independent of the candidate set size, scaling efficiently to large codebases.

The failure analysis reveals that imperfections are now dominated by (1) missing ground-truth BIC labels or candidates (as when the bug was introduced in a file not touched by the fix commit), (2) ambiguous code histories, and (3) residual reasoning errors where agent’s analysis capabilities are still limited.

Generalization Across LLMs and Agents

Evaluation with multiple open and closed agentic frameworks (Claude Code, OpenHands) and across a suite of LLMs (Claude Haiku/Sonnet/Opus 4.5, minimax-m2.5, glm-5) shows that Simple-SZZ-Agent's advantage generalizes. Even resource-constrained (and cheap) configuration outperforms all pre-agentic SZZ approaches by a wide margin, and more advanced configurations yield the strongest empirical results to date.

Implications for Software Engineering Research and Practice

The findings of (2603.29378) carry both immediate and broad implications:

Agentic search as a dominant strategy: Simple agent integration (file I/O, grep/glob) into the search for BICs, paired with LLM inference, establishes a new baseline for this canonical problem, exposing the limitations of decades of static heuristic improvement.
Mechanistic transparency and extensibility: The emergence of short pattern synthesis as the core agentic behavior suggests systematizing “pattern distillation” as a primitive for codebase search, root cause analysis, and potentially for unsupervised bug classification.
Beyond BIC detection: The approach opens avenues in bug root cause explanation and general bug variant detection. The extracted patterns provide a succinct (and automatable) handle for identifying similar faults in related repositories—a capability of increasing importance for vulnerability tracking and automated triage.
Benchmarks and evaluation methodology: High performance is robust to LLM model advances, but error analysis highlights the limitations of dataset coverage and the need for improved ground truth; this further underscores the importance of context-rich, developer-verified benchmarks.
Tooling and workflow integration: The combination of agentic code search with basic UNIX-like developer tools (grep, file reader) could be generalized to many tasks in empirical software engineering, offering a path toward modular, transparent, and effective automation rather than monolithic end-to-end LLMs.

Future Directions

Primary open problems include:

Handling cases where the true BIC lies outside the fix-touched files (requiring more global repository context or cross-file dependence analysis).
Systematic distillation of extracted patterns for knowledge transfer or multi-project analytics.
Integration with bug oracle outputs (e.g., crash logs, failing tests) to enable pre-fix localization and repair pipelines.
Extension of agentic workflows to more diverse languages, architectures, and codebases, and evaluation under adversarial or obfuscated repository history manipulations.

The empirical findings suggest that as LLM and agentic toolchains evolve, agent-based workflows are likely to become the foundation for automated program understanding and defect analysis across domains.

Conclusion

By reframing BIC identification via agentic workflows and demonstrating robust, generalizable improvements, (2603.29378) establishes agents with simple developer tools as the empirically superior paradigm for this long-standing software engineering task. The mechanism of pattern distillation and targeted search not only brings practical, scalable bug-introducing commit identification within reach, but also provides new primitives for related problems in root cause analysis, bug variant detection, and automated program repair. These findings are immediately relevant for researchers designing the next generation of intelligent program analysis tools, as well as for practitioners facing the demands of large-scale vulnerability triage and software maintenance.

References:

"How and Why Agents Can Identify Bug-Introducing Commits" (2603.29378)
Sliwerski, Zimmermann, and Zeller, "When do changes induce fixes?" [sliwerski_2005_original_szz]
Tang et al., "LLM4SZZ: Enhancing SZZ Algorithm with Context-Enhanced Assessment on LLMs" [tang_2025_llm4szz]
Lyu et al., "Evaluating SZZ Implementations: An Empirical Study on the Linux Kernel" [Lyu_2024_Linux_Kernel_Dataset]
Additional software engineering agent research (Jin et al. [jin_2025_agents_in_se_survey], Wang et al. [wang_2024_openhands], Yang et al. [yang_2024_sweagent], Zhang et al. [zhang_2024_autocoderover])