LLMs are Bug Replicators: An Empirical Study on LLMs' Capability in Completing Bug-prone Code

Published 14 Mar 2025 in cs.SE | (2503.11082v1)

Abstract: LLMs have demonstrated remarkable performance in code completion. However, the training data used to develop these models often contain a significant amount of buggy code. Yet, it remains unclear to what extent these buggy instances influence LLMs' performance when tackling bug-prone code completion tasks. To fill this gap, this paper presents the first empirical study evaluating the performance of LLMs in completing bug-prone code. Through extensive experiments on 7 LLMs and the Defects4J dataset, we analyze LLMs' accuracy, robustness, and limitations in this challenging context. Our experimental results show that completing bug-prone code is significantly more challenging for LLMs than completing normal code. Notably, in bug-prone tasks, the likelihood of LLMs generating correct code is nearly the same as generating buggy code, and it is substantially lower than in normal code completion tasks (e.g., 12.27% vs. 29.85% for GPT-4). To our surprise, 44.44% of the bugs LLMs make are completely identical to the pre-fix version, indicating that LLMs have been seriously biased by historical bugs when completing code. Additionally, we investigate the effectiveness of existing post-processing techniques and find that while they can improve consistency, they do not significantly reduce error rates in bug-prone code scenarios. Our research highlights the limitations of current LLMs in handling bug-prone code and underscores the need for improved models and post-processing strategies to enhance code completion accuracy in real-world development environments.

Abstract PDF Upgrade to Chat

Summary

The paper conducts an empirical study on LLMs completing bug-prone code, finding models like GPT-4 often replicate historical bugs, with a significantly lower correct completion rate compared to normal code.
The study shows LLMs memorize and replicate nearly half of errors from historical bugs, and current post-processing techniques fail to substantially reduce these error rates.
Findings highlight that LLMs struggle to reliably differentiate historical bug patterns, necessitating advancements in model training and error correction for practical software development.

An Empirical Study on LLMs' Performance in Bug-Prone Code Completion

The paper "LLMs are Bug Replicators: An Empirical Study on LLMs' Capability in Completing Bug-prone Code" provides a comprehensive analysis of the efficacy of LLMs in the domain of code completion, particularly focusing on scenarios involving historically buggy code. While LLMs have demonstrated significant prowess in code completion tasks, this investigation highlights the inherent challenges these models face when confronted with bug-prone code contexts.

Overview and Key Findings

The study involved several state-of-the-art LLMs, including OpenAI’s GPT-4 series, CodeLlama, StarCoder, CodeGEN, and Gemma. Using the Defects4J dataset, which is a well-established benchmark for Java code bugs, the authors constructed detailed empirical evaluations to understand how these models perform when tasked with completing bug-prone code.

Performance Metrics

Key performance metrics derived from this study include:

Correct-to-Buggy Completion Ratio: The study reveals an alarming trend where LLMs exhibit a nearly equal probability of generating correct versus buggy code, especially in contexts where historical buggy patterns are prevalent. Notably, the GPT-4 model demonstrated a correct code generation probability of only 12.27% compared to its performance on normal code tasks at 29.85%.
Bug Memorization: Approximately 44.44% of errors generated by LLMs were found to be identical to existing historical bugs, suggesting a significant degree of memorization rather than learning dynamic code correction strategies.
Code Construct Susceptibility: Certain programming constructs, such as method invocations, return statements, and conditional constructs (if statements), were identified as particularly prone to errors during LLM-generated completions.

Limitations of Post-Processing Techniques

Although post-processing techniques were deployed to improve output consistency, they did not substantially reduce error rates. This finding underlines the insufficiency of current post-processing strategies in enhancing the reliability of LLMs for bug-prone code completion tasks.

Implications and Future Research Directions

The implications of this research are manifold. Practical applications of LLMs in integrated development environments (IDEs) and real-world software development necessitate models that can reliably differentiate historical bug patterns from correct code structures. The study’s insights regarding LLM memorization biases indicate a pressing need for more sophisticated model architectures or training methodologies that do not rely heavily on data memorization but incorporate robust code understanding and error correction mechanisms.

Furthermore, this research prompts speculation on the future direction of AI in software development. Enhanced methods for code representation learning, dynamic model training that adapts to evolving software patterns, and the integration of comprehensive debugging heuristics are potential pathways to mitigate the limitations identified.

Conclusion

This empirical study provides a critical understanding of the current capabilities and limitations of LLMs in bug-prone code completion tasks. While LLMs offer promising enhancements to coding efficiency, their propensity to replicate historical bugs presents a noteworthy challenge. The findings compel both academia and industry to pursue advancements in model training, error handling, and intelligent post-processing techniques, fostering more reliable deployment of LLMs in sophisticated coding environments. As AI continues to augment human capabilities in software engineering, it is imperative to address these challenges to fully realize the potential benefits that LLMs promise.