Beyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios

Published 16 Mar 2025 in cs.SE and cs.AI | (2503.12374v3)

Abstract: AI-driven software development has rapidly advanced with the emergence of software development agents that leverage LLMs to tackle complex, repository-level software engineering tasks. These agents go beyond just generation of final code; they engage in multi-step reasoning, utilize various tools for code modification and debugging, and interact with execution environments to diagnose and iteratively resolve issues. However, most existing evaluations focus primarily on static analyses of final code outputs, yielding limited insights into the agents' dynamic problem-solving processes. To fill this gap, we conduct an in-depth empirical study on 3,977 solving-phase trajectories and 3,931 testing-phase logs from 8 top-ranked agents evaluated on 500 GitHub issues in the SWE-Bench benchmark. Our exploratory analysis shows that Python execution errors during the issue resolution phase correlate with lower resolution rates and increased reasoning overheads. We have identified the most prevalent errors -- such as ModuleNotFoundError and TypeError -- and highlighted particularly challenging errors like OSError and database-related issues (e.g., IntegrityError) that demand significantly more debugging effort. Furthermore, we have discovered 3 bugs in the SWE-Bench platform that affect benchmark fairness and accuracy; these issues have been reported to and confirmed by the maintainers. To promote transparency and foster future research, we publicly share our datasets and analysis scripts.

Abstract PDF Upgrade to Chat

Summary

The paper presents a process-oriented error analysis of AI-driven software development agents during real-world GitHub issue resolution.
The study employs detailed logging from 500 GitHub issues across 12 Python repositories to identify recurring errors such as ModuleNotFoundError and TypeError.
The findings suggest proactive measures, including dependency checks and static analysis, to enhance agents’ error-handling capabilities in dynamic software engineering tasks.

"Beyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios" (2503.12374)

Overview

The paper focuses on analyzing errors encountered by AI-driven software development agents, particularly those that employ LLMs, during real-world GitHub issue resolution tasks. It emphasizes understanding dynamic problem-solving processes rather than simply evaluating the final code outputs, thus providing a deeper insight into the agents' abilities and limitations in practical software engineering scenarios.

Figure 1: Study overview: solving-phase trajectories inform analyses of unexpected-error impact (RQ1), common-error prevalence (RQ2), and challenging-error identification (RQ3); testing-phase logs reveal testing errors and failures (RQ4).

Study Design

The study utilizes data from SWE-Bench Verified, which comprises 500 GitHub issues across 12 Python repositories. These issues were validated by professional engineers, ensuring the benchmark's effectiveness in representing real-world challenges. Eight agents are selected based on their capability to capture detailed execution outputs, observation delineation, and unmodified contents. This detailed logging enables in-depth error analysis over resolving phases.

Exploratory Error Analysis

The paper begins with a comprehensive exploratory analysis of execution errors during the solving phase. It demonstrates that Python execution failures correlate with increased reasoning steps and lower resolution rates. Errors like ModuleNotFoundError, TypeError, and AttributeError are identified as prevalent, highlighting dependencies and type management as significant challenges.

Figure 2: Resolution Rate by Error Frequency

Prevalent and Challenging Errors

The analysis identifies prevalent error types and categorizes them into Python built-in errors and custom-defined exceptions. Errors such as ModuleNotFoundError and TypeError are frequent, underscoring dependency and type-checking issues. Challenging errors, which recur during a task, are analyzed to understand their impact. Errors like OSError exhibit high recurrence ratios, indicating difficulties in resolving them due to system operation challenges.

Testing Phase and Cross-Phase Errors

Testing phase analysis reveals that unresolved tasks primarily stem from Python execution failures. The paper identifies cross-phase errors—errors that persist from the solving phase to testing—highlighting stealthy errors like TypeError as persistent challenges. Additionally, manual investigation reveals failures not explicitly linked to Python errors, pointing to potential issues in the evaluation platform.

Implications and Future Work

The research suggests various directions for future work, such as:

Developing error-prone benchmarks that focus on scenarios like database integrity and dependency challenges.
Enhancing agents' workflows with proactive error avoidance measures, like early dependency checks and static analysis tools.
Integrating retrieval-augmented generation approaches to improve error detection and recovery mechanisms.
Encouraging greener AI-driven software development by quantifying and optimizing energy costs of prolonged error resolution phases.
Cross-benchmark exploration to validate findings across diverse tasks beyond GitHub issue resolution.

Conclusion

The paper provides a process-oriented error analysis that reveals current challenges faced by software development agents and offers insights into improving their error-handling capabilities. By highlighting recurring errors and proposing proactive solutions, it aims to enhance agents' performance, reduce computational overhead, and promote sustainable software development practices. The study also identifies three bugs in the SWE-Bench platform, further underscoring the need for reliable and accurate evaluation frameworks.

Markdown