AgoneTest: Automated LLM Java Test Evaluation

Updated 18 January 2026

AgoneTest is an automated evaluation framework that generates and rigorously assesses Java unit tests produced by LLMs.
It employs a modular pipeline integrating prompt engineering, LLM invocation, and systematic evaluation using metrics like coverage, mutation, and test smells.
The framework enables direct comparison between LLM-generated and human-written test suites, addressing gaps in prior isolated or method-level studies.

AgoneTest is an automated evaluation framework designed for the generation and rigorous assessment of unit test suites produced by LLMs for Java projects. It operationalizes all phases of the test suite pipeline—from benchmark dataset construction to systematic evaluation of LLM-generated artifacts—enabling head-to-head comparison with human-engineered test suites using coverage, mutation, and test smell metrics. AgoneTest provides standardized, large-scale, project-level evaluation infrastructure to address the methodological gaps in prior work, which focused primarily on isolated or method-level scenarios and lacked automation in integration and assessment (Lops et al., 2024, Lops et al., 25 Nov 2025).

1. System Architecture and Pipeline

AgoneTest employs a closed-loop, modular pipeline that operationalizes every stage of test suite generation and evaluation. Its architecture consistently separates three core phases: (1) Strategy Configuration, (2) Automated Test/Prompt Generation, and (3) Strategy Evaluation.

Strategy Configuration:
- Sample Projects Selection: Projects are sampled from the Classes2Test dataset, yielding a list of focal Java classes for testing.
- Parameter Elicitation: For each sampled project, build descriptors (Maven, Gradle) are parsed to extract the relevant testing framework (JUnit 4/5), Java version, and example class/test pairs for few-shot learning prompts (Lops et al., 2024).
Prompt and Test Generation:
- Prompt Engineering: AgoneTest supports both zero-shot (general instructions) and few-shot (with exemplars) prompt templates. Placeholders for the focal class and environmental parameters are filled automatically. Prompts are constructed to fit model token budgets (e.g., using tiktoken parsers).
- LLM Invocation: Test classes are generated by querying LLMs (gpt-4-1106-preview, gpt-3.5-turbo, gpt-4o-mini, gemini-1.5-pro, llama3.1:70b), via standardized APIs (LiteLLM). Output is post-processed to remove non-code elements and is rendered as a valid JUnit test class placed under src/test/java.
- Post-processing: Non-compiling or runtime-failing test classes are filtered out before downstream evaluation.
Integration and Evaluation:
- Dependency Injection and Build: Project build files are augmented, if necessary, to include JaCoCo (coverage), PiTest (mutation testing), and tsDetect (test smell detection).
- Testing and Metrics Harvesting: Only "green" (compiling and passing) test suites are evaluated. Metrics are aggregated into final reports for subsequent analysis (Lops et al., 2024, Lops et al., 25 Nov 2025).

A schematic (ASCII) pipeline from (Lops et al., 25 Nov 2025):

1	Strategy Config ──▶ Prompt Creator ──▶ LLM Test Generation ──▶ Test Integration & Compilation ──▶ Test Execution & Metrics ──▶ Report

2. Dataset Construction: Classes2Test Benchmark

AgoneTest is underpinned by the Classes2Test dataset, a benchmark designed for project-level, class-focused evaluation:

Origin and Mapping: Classes2Test extends Methods2Test by mapping Java classes ("classes under test") to their associated test classes at the repository level. The mapping procedure combines naming convention heuristics (e.g., MyClassTest) with AST-based reference validation (≥ 60% class reference dominance), discarding ambiguous associations.
Corpus Statistics (Lops et al., 2024, Lops et al., 25 Nov 2025):

| Property | Value | |-----------------------|-------------| | Unique Repositories | 9,410 | | Total Test Classes | 147,473 | | Avg. LOC/class | 1,178 | | Cyclomatic Complexity | 55.3 | | Java Versions | 8/11/17/21+ | | Test Framework Split | JUnit4: 55%, JUnit5: 41%, Other: 4% |

For experimental splits, random subsets were drawn to ensure diversity in codebase size and complexity (e.g., 10 repositories/94 classes in (Lops et al., 2024)).

3. Test Generation Methodology

AgoneTest standardizes the prompt-based unit test generation process using LLMs:

Prompt Engineering:
- Zero-Shot: Prompts instruct the LLM to generate a comprehensive test class for a given Java class using relevant project context information (testing framework, Java version).
- Few-Shot: Prompts provide an additional input example pair (Java class + test) to guide the model, appended to the target focal class (Lops et al., 2024, Lops et al., 25 Nov 2025).
- Prompts employ <code>...</code> wrappers and "system" messages to constrain LLM output to code-only responses.
Generation and Filtering:
- Each prompt-LLM combination is executed, with token quota checks performed beforehand. Outputs are parsed and filtered for errors (syntax, parsing, failed compilation, execution failure). Only successful ("green") test suites progress to metric computation (Lops et al., 2024).

4. Evaluation Metrics and Formalization

AgoneTest implements a standardized, multi-factor quality metric suite:

Coverage Metrics (JaCoCo-based):
- Line Coverage:
$C_{\text{line}} = \frac{L_c}{L_t}$

$L_t$ = Total lines in SUT; $L_c$ = Lines executed at least once. - Branch Coverage:

$C_{\text{branch}} = \frac{B_c}{B_t}$

$B_t$ = Total branches; $B_c$ = Covered branches. - Method Coverage:

$C_{\text{method}} = \frac{M_c}{M_t}$

$M_t$ = Total methods; $M_c$ = Invoked methods. - Instruction Coverage:

$C_{\text{instr}} = \frac{I_c}{I_t}$

$I_t$ = Total bytecode instructions; $I_c$ = Instructions executed.
Mutation Score (PiTest-based):

$M = \frac{\mu_k}{\mu_t}$

$\mu_t$ = Total mutants; $\mu_k$ = Mutants killed by test failures.

Test Smell Indicators (tsDetect):

Quantifies 18 Java test code smells (Assertion Roulette, Conditional Test Logic, Eager Test, etc.) by class, e.g., for aggregation:

$\overline{s_k} = \frac{1}{N_{\text{comp}}} \sum_{i=1}^N \kappa_i s_{k,i}, \quad N_{\text{comp}} = \sum_{i=1}^N \kappa_i,\quad \kappa_i \in \{0,1\}$

where $\kappa_i$ signals green test suites.

Compilation Rate:

$R_{\text{build}} = \frac{1}{N}\sum_i \kappa_i$

Tracks the fraction of generated test suites that successfully compile and pass initial execution (Lops et al., 25 Nov 2025).

Metric reporting is performed per model/prompt configuration, using aggregate means over green (passing and compiling) classes.

5. Experimental Comparison and Results

AgoneTest enables model- and prompt-level benchmarking under realistic conditions. Key findings synthesize results from both (Lops et al., 2024, Lops et al., 25 Nov 2025):

Compilation and Pass Rates (Lops et al., 2024):

| Model/Prompt | Compile | Green | |-------------------------- |-------- |-------- | | gpt-3.5-turbo zero-shot | 68.1% | 38.3% | | gpt-4 zero-shot | 80.9% | 30.9% | | gpt-4o-mini zero-/few-shot| 28.6%, 25.3% | – | | llama3.1:70b zero-/few-shot| 9.8%, 7.1% | – | | Human | 100% | 100% |

Coverage and Mutation Scores (Lops et al., 2024, Lops et al., 25 Nov 2025):

| Model/Prompt | Line Cov. | Branch Cov. | Method Cov. | Mutation | |-------------------------- |-----------|-------------|-------------|---------- | | gpt-4 zero-shot | 86.6% | 77.7% | 85.5% | 54.6% | | gpt-3.5-turbo zero-shot | 77.7% | 70.6% | 84.8% | 54.7% | | gemini-1.5-pro few-shot | 89.8% | – | 92.9% | – | | llama3.1:70b few-shot | – | 79.8% | – | 89.2% | | Human | 76.6–73.2%| 80.9–48.7% | 69.8–74.0% | 69.1–40.4%|

For compiled test suites, LLMs can achieve coverage metrics comparable to or exceeding the human-written baseline, but mutation scores (a defect-detection proxy) generally remain lower for LLMs except for the best prompt-model configurations (e.g., llama3.1:70b few-shot, mutation score 89.2%) (Lops et al., 25 Nov 2025). Low compilation rates—dominated by missing imports, override/type errors, and syntax issues—are a principal limiting factor.

6. Limitations, Recommendations, and Extensions

Several methodological and practical limitations are identified:

Limited Compilation Success: The majority of LLM-generated test suites do not compile or fail to pass all tests, particularly for complex classes, constraining aggregate metric validity (Lops et al., 25 Nov 2025).
Proxy Metrics: Coverage, mutation, and smell quantities are necessary but not sufficient proxies for semantic correctness.
Dataset Bias: Restriction to open-source Java projects and specific build/test frameworks introduces generalizability constraints.
Data Leakage and Run Variance: LLMs may be pretrained on subsets of Classes2Test or correlated corpora; results are sensitive to prompt design and generation temperature.

Recommendations and roadmap for system evolution include:

Automated Repair Loops: Error feedback-driven prompt refinements or LLM-based repair for compilation/runtime faults.
Retrieval-Augmented and Mutation-Aware Prompting: Enriching prompt context with project and Javadoc retrieval; explicit assertion of mutant examples.
Advanced Statistical Rigor: Integrating confidence intervals, paired hypothesis tests, and multi-run evaluations.
Language and Paradigm Generalization: Extending the benchmark and tooling beyond Java and JUnit to other languages (e.g., Python, JavaScript) and testing strategies (property-based, integration testing).
Comprehensive Metrics: Adding flaky test detection, test prioritization, and integration with fault localization and review frameworks.

This suggests that AgoneTest provides a reusable, extensible platform for systematic, reproducible, and quantitative evaluation of LLM-based test generation in real-world software contexts (Lops et al., 25 Nov 2025).

7. Context and Significance

AgoneTest addresses the methodological shortcomings of prior LLM test-gen studies, which frequently assess only single methods or small-scale examples. By supporting automated, class-level benchmarking with high-fidelity metrics across a vast and representative Java landscape, it clarifies the strengths and weaknesses of contemporary LLMs in software testing. Its modular architecture supports rapid experimentation with new LLMs, prompting strategies, and metric formulations, thereby advancing the research frontier in automated software quality assurance (Lops et al., 2024, Lops et al., 25 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (2)

A System for Automated Unit Test Generation Using Large Language Models and Assessment of Generated Test Suites (2024)

LLMs for Automated Unit Test Generation and Assessment in Java: The AgoneTest Framework (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AgoneTest Framework.