Test Wars: A Comparative Study of SBST, Symbolic Execution, and LLM-Based Approaches to Unit Test Generation

Published 17 Jan 2025 in cs.SE | (2501.10200v1)

Abstract: Generating tests automatically is a key and ongoing area of focus in software engineering research. The emergence of LLMs has opened up new opportunities, given their ability to perform a wide spectrum of tasks. However, the effectiveness of LLM-based approaches compared to traditional techniques such as search-based software testing (SBST) and symbolic execution remains uncertain. In this paper, we perform an extensive study of automatic test generation approaches based on three tools: EvoSuite for SBST, Kex for symbolic execution, and TestSpark for LLM-based test generation. We evaluate tools performance on the GitBug Java dataset and compare them using various execution-based and feature-based metrics. Our results show that while LLM-based test generation is promising, it falls behind traditional methods in terms of coverage. However, it significantly outperforms them in mutation scores, suggesting that LLMs provide a deeper semantic understanding of code. LLM-based approach also performed worse than SBST and symbolic execution-based approaches w.r.t. fault detection capabilities. Additionally, our feature-based analysis shows that all tools are primarily affected by the complexity and internal dependencies of the class under test (CUT), with LLM-based approaches being especially sensitive to the CUT size.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that SBST, symbolic execution, and LLM-based approaches offer distinct advantages, with LLM tests achieving higher coverage but traditional methods excelling in fault detection.
Evaluation on the GitBug dataset shows that factors like cyclomatic complexity and class dependencies significantly affect the performance of tools such as EvoSuite, Kex, and TestSpark.
The study underlines the potential for hybrid strategies that combine these methods to enhance overall unit test generation effectiveness.

Automatic Test Generation using SBST, Symbolic Execution, and LLM Approaches

In this paper, Abdullin et al. investigate the performance of three approaches for automatic unit test generation: Search-Based Software Testing (SBST), Symbolic Execution, and techniques based on LLMs. The study conducts a comparative analysis using three contemporary tools: EvoSuite, Kex, and TestSpark, evaluated on the GitBug Java dataset. The paper emphasizes the strengths and weaknesses of each approach and explores the potential for hybrid strategies in test generation.

Introduction to Test Generation Approaches

The paper elucidates the necessity of automated test generation in software engineering, citing the limitations of manual testing. It highlights the effectiveness of SBST, symbolic execution, and the emerging capabilities of LLM-based tools in generating unit tests. EvoSuite and Kex represent traditional methods, while TestSpark leverages LLMs to explore new frontiers in software testing. The authors aim to fill existing gaps in comparative studies which often neglect symbolic execution or involve potential data contamination risks.

Evaluation of LLMs in Test Generation

The research critically evaluates various LLMs used in TestSpark for test generation to identify the most effective model. It finds that ChatGPT-4o, with its larger context size, achieves the best results by producing more compilable tests and achieving higher coverage compared to other LLMs with smaller context windows.

Figure 1: Execution-metrics comparisons of the different in TestSpark.

Comparison of Tools for Test Generation

The comparative evaluation of EvoSuite, Kex, and TestSpark with ChatGPT-4o reveals distinct performance characteristics. EvoSuite and Kex show comparable results for line and branch coverage, whereas TestSpark excels in mutation scores, indicating a deeper semantic understanding of code. However, in terms of fault detection, traditional methods outperform LLM-based approaches substantially.

Figure 2: Comparison of different automatic test generation tools.

Correlation Between Code Features and Tool Performance

Analyzing correlations between code features and tool performance provides insights into the applicability of each method. Factors such as cyclomatic complexity, class dependencies, and lines of code influence performance differently across tools. TestSpark's reliance on textual information like comments and Java Docs is also noted, contrasting with bytecode-level analyses in EvoSuite and Kex.

Figure 3: Line coverage.

Implications for Hybrid Test Generation Strategies

The study suggests that combining the strengths of SBST, symbolic execution, and LLM-based methods could lead to more effective test generation tools. By leveraging the unique advantages of each approach in different scenarios, a hybrid strategy could address the weaknesses identified in isolated methods.

Conclusion

Abdullin et al.'s comparative study underscores the potential of integrating diverse test generation approaches for achieving comprehensive software testing solutions. The insights from their robust evaluation offer a path forward for developing hybrid tools that harness the best aspects of each methodology. As software complexity grows, such interdisciplinary strategies could prove indispensable in maintaining high software quality.

Figure 4: EvoSuite.

Future Work

Future research could focus on designing hybrid test generation strategies that dynamically choose the best approach based on class characteristics and testing goals. Fine-grained orchestration between SBST, symbolic execution, and LLMs could enhance test effectiveness, coverage, and fault detection capabilities beyond what is currently achievable by individual methods.