Agentic APR: LLM-Based Automated Repair

Updated 3 February 2026

Agentic APR is a dynamic approach that deploys autonomous LLM agents to iteratively explore and repair software bugs through feedback loops.
It integrates real-time static analysis and test execution signals to refine patches and enhance repair accuracy.
Empirical evaluations show scalable performance in industrial settings with improved bug-fix rates and efficient multi-hunk repair strategies.

Agentic Automated Program Repair (APR) leverages the autonomous planning and reasoning capabilities of LLMs to resolve software bugs via iterative interactions with real-world code, test infrastructure, and toolchains. Unlike traditional APR, which frames bug-fixing as a static prediction or templated transformation task, agentic APR instantiates the LLM as a tool-using agent within a tightly orchestrated feedback loop. This enables dynamic exploration of the repair search space, direct integration of testing and static analysis signals, and scalable deployment across complex, large-scale industrial codebases (Maddila et al., 24 Jul 2025).

1. The Agentic APR Workflow: ReAct Harness and Action Space

Agentic APR relies on a ReAct-style loop, where at each discrete time step $t$ the agent maintains a trajectory of reasoning history, represented as: $\mathcal{H}_t = \bigl\{(\mathit{thought}_i,\mathit{action}_i,\mathit{observation}_i)\bigr\}_{i=1}^{t-1}$ The LLM processes a prompt augmented by this history, static analysis feedback, and test execution traces, producing a tuple $(\mathit{thought}_t, \mathit{action}_t)$ , where the action $a_t$ is selected from a formal action space $A = \{a_1,\dots,a_{15}\}$ (Maddila et al., 24 Jul 2025). Atomic actions include:

File operations: ReadFile, ReadDirectory, FindFile, GoToLine
Code search: SearchCode, SearchInFile, SearchClass, SearchMethod, SearchMethodInFile, SearchMethodInClass
Patch generation: Edit(path, instructions)
Code validation: RunTests(testSelectors), GetDiffDetails, GetTaskDetails
Termination: Exit(summary)

Upon performing an action, the resulting observation $o_t$ (e.g., file contents, diff, test results) is appended to the history, forming $\mathcal{H}_{t+1}$ . Iteration continues until early success (test suite passes), agent termination, or a step-budget $T_{\max}$ is exhausted.

2. Feedback Mechanisms: Static Analysis, Test Execution, LLM-Judge

Crucial to agentic APR is the integration of neuro-symbolic feedback streams. For each edit, the agent immediately receives:

Static analysis report $f^{\mathrm{static}}_t$ (e.g., lint/build, type validation)
Test execution trace $f^{\mathrm{test}}_t$ (regression test pass/failures)

Prompt construction for iteration $\mathcal{H}_t = \bigl\{(\mathit{thought}_i,\mathit{action}_i,\mathit{observation}_i)\bigr\}_{i=1}^{t-1}$ 0 concatenates $\mathcal{H}_t = \bigl\{(\mathit{thought}_i,\mathit{action}_i,\mathit{observation}_i)\bigr\}_{i=1}^{t-1}$ 1 with $\mathcal{H}_t = \bigl\{(\mathit{thought}_i,\mathit{action}_i,\mathit{observation}_i)\bigr\}_{i=1}^{t-1}$ 2 and $\mathcal{H}_t = \bigl\{(\mathit{thought}_i,\mathit{action}_i,\mathit{observation}_i)\bigr\}_{i=1}^{t-1}$ 3, ensuring that patch refinement is informed by validator outputs (Maddila et al., 24 Jul 2025). After the agent’s loop, an additional LLM–as-a-Judge model acts as a binary classifier on generated patches, discarding those with high probability of being unacceptable before human review. Precision on bad-patch filtering is empirically tuned for high trust (e.g., precision = 0.867) (Maddila et al., 24 Jul 2025, Cambronero et al., 3 Oct 2025).

3. Performance Metrics and Ablation Insights

Quantitative assessment relies on several core metrics:

Solve Rate ( $\mathcal{H}_t = \bigl\{(\mathit{thought}_i,\mathit{action}_i,\mathit{observation}_i)\bigr\}_{i=1}^{t-1}$ 4): $\mathcal{H}_t = \bigl\{(\mathit{thought}_i,\mathit{action}_i,\mathit{observation}_i)\bigr\}_{i=1}^{t-1}$ 5
Error Rate ( $\mathcal{H}_t = \bigl\{(\mathit{thought}_i,\mathit{action}_i,\mathit{observation}_i)\bigr\}_{i=1}^{t-1}$ 6): $\mathcal{H}_t = \bigl\{(\mathit{thought}_i,\mathit{action}_i,\mathit{observation}_i)\bigr\}_{i=1}^{t-1}$ 7
Average feedback iterations ( $\mathcal{H}_t = \bigl\{(\mathit{thought}_i,\mathit{action}_i,\mathit{observation}_i)\bigr\}_{i=1}^{t-1}$ 8): mean or median ReAct steps prior to fix or termination
Cost–Latency: $\mathcal{H}_t = \bigl\{(\mathit{thought}_i,\mathit{action}_i,\mathit{observation}_i)\bigr\}_{i=1}^{t-1}$ 9 (LLM compute) and $(\mathit{thought}_t, \mathit{action}_t)$ 0 (end-to-end wallclock time), parameterized by model and workflow choices

Benchmarks reveal that agentic APR approaches, combining both static and test-based feedback, attain $(\mathit{thought}_t, \mathit{action}_t)$ 1 single-run solve rate ( $(\mathit{thought}_t, \mathit{action}_t)$ 2) and up to $(\mathit{thought}_t, \mathit{action}_t)$ 3 ( $(\mathit{thought}_t, \mathit{action}_t)$ 4) in multiple runs, with an average of $(\mathit{thought}_t, \mathit{action}_t)$ 5 feedback iterations per fix (Maddila et al., 24 Jul 2025). Fine-tuning smaller LLMs (e.g., iCodeLlama-70B) delivers competitive performance against larger public Llama-405B models, especially when leveraging detailed natural language instructions and advanced patch formats (search-replace) (Maddila et al., 24 Jul 2025).

4. Agentic APR in Enterprise and Industrial Benchmarks

Empirical studies at Google and elsewhere confirm that agentic APR scales to multi-language, multi-file, repository-wide contexts (Rondon et al., 13 Jan 2025, Maddila et al., 24 Jul 2025). The Passerine agent exhibits $(\mathit{thought}_t, \mathit{action}_t)$ 6 plausible fix rate for machine-reported bugs and $(\mathit{thought}_t, \mathit{action}_t)$ 7 for human-reported issues on the GITS-Eval benchmark; semantic equivalence rates are $(\mathit{thought}_t, \mathit{action}_t)$ 8 and $(\mathit{thought}_t, \mathit{action}_t)$ 9, respectively (Rondon et al., 13 Jan 2025). Industrial bug distributions feature broader language diversity, sparser code-term density, and greater patch spread than open-source SWE-Bench datasets, necessitating enhanced fault localization, robust context management, and tailored toolsets.

Production deployments of agentic APR agents yield high review and acceptance rates: $a_t$ 0 of generated diffs are reviewed, and $a_t$ 1 of those reviewed are landed in the codebase, equivalent to $a_t$ 2 of generated fixes (Maddila et al., 24 Jul 2025).

5. Noise Reduction and Patch Filtering

A central challenge in large-scale agentic repair is minimizing developer distraction from implausible or doomed patches. Dual LLM–based policies—Bug Abstention and Patch Validation—substantially improve signal quality (Cambronero et al., 3 Oct 2025). Bug abstention predicts non-fixable bugs, filtering attempts for an overall lift of 13 percentage points in filtered success@1; patch validation on output trajectories further boosts correct patch rates by 15 pp. Combined, these policies achieve an absolute increase of 39 pp in filtered success@1 among human-reported Google bugs (from 11% to 53%) (Cambronero et al., 3 Oct 2025). Patch validation utilizes confidence-based scoring and percentile-based classifiers, with minimal compute overhead for high-precision triage.

6. Advanced Workflows: Semantics Awareness, History, and Cogeneration

Recent agentic APR systems integrate semantic modalities—issue abstraction, code- and execution-semantics—which enables robust repair even for multi-line or edge-case bugs (Pabba et al., 19 Jun 2025). SemAgent’s modular workflow merges execution trace analysis, NL issue abstraction, and code semantics mapping into a two-stage repair/review cycle, yielding a $a_t$ 3 solve rate on SWEBench-Lite, with pronounced gains for complex issues.

History-aware agents (HAFixAgent) incorporate blame-derived diffs and function-body snapshots into agent context, improving multi-hunk bug fix rate by up to 212.3% over baseline agents (Shi et al., 2 Nov 2025).

Agentic APR pipelines increasingly merge fix and Bug Reproduction Test (BRT) generation within unified agent trajectories, eliminating the need for dual pipelines and improving reviewer trust. Cogeneration prompts that mandate both patch and test yield fix and BRT coverage as high as dedicated single-task agents, at equivalent efficiency (Cheng et al., 27 Jan 2026).

7. Qualitative Themes and Deployment Experience

Direct feedback from software engineers highlights that agentic APR agents deliver substantial productivity benefits. Positive themes include rapid approval for straightforward fixes, explicit gratitude for time saved, and surprise at detection of otherwise unnoticed failures (surfacing CI gaps). Iterative fixes—even partial or incomplete patches—jumpstart human repair processes. Negative themes typically stem from test flakiness, reviewer unfamiliarity, missing validator integration, or environmental context gaps, leading to incremental improvements in agent orchestration (e.g., isolated test containers, diff enhancement, and richer infra-state injection) (Maddila et al., 24 Jul 2025).

8. Future Directions and Research Considerations

Continued progression in agentic APR depends on research across several focal areas:

Context management for large-scale, multi-hunk/multi-file repairs, including LLM-based summarization and windowing (Rondon et al., 13 Jan 2025)
Automated bug reproduction for human-reported issues, ensuring accurate test oracle construction (Cheng et al., 27 Jan 2026)
Guided diversity in patch generation via adversarial reasoning and multi-agent collaboration, mitigating overfitting and intent misalignment (Ye et al., 19 May 2025)
Cost–latency optimization, adaptive budgets, and selective testing to improve scalability
Cross-language support for polyglot monorepos and heterogeneous toolchains

Agentic Automated Program Repair, as instantiated by contemporary LLM-powered agents, constitutes a comprehensive neuro-symbolic framework for software bug resolution, bridging the gap between human-level reasoning, dynamic tool use, feedback-driven learning, and large-scale industrial applicability. The field is characterized by rigorous experimental validation, modular architecture, and integration into real-world developer workflows.