ManySStuBs4J Benchmark: Java Bug Corpus

Updated 18 January 2026

ManySStuBs4J benchmark is a large-scale, semantically annotated corpus of 153,652 single-statement Java bugs and fixes from over 1,000 open-source projects.
It utilizes a rigorous AST-based methodology with 16 standardized SStuB templates to classify and validate bug fixes.
The dataset underpins empirical evaluations in automated program repair, fault localization, and exposure-aware LLM code analysis.

The ManySStuBs4J benchmark is a large-scale, semantically annotated corpus of single-statement Java bugs (so-called "Simple, Stupid Bugs" or SStuBs) and their corresponding fixes, drawn from over a thousand open-source Java projects. Its design enables fine-grained empirical evaluation of automated program repair (APR), fault localization, and static analysis, as well as exposure-aware studies of code LLM behavior. The benchmark’s scope, construction methodology, template taxonomy, and emergent findings have made it a central reference for realism-focused program repair research and analysis of simple semantic defects in software engineering (Karampatsis et al., 2019, Mosolygó et al., 2021, Liu et al., 20 Sep 2025, Al-Kaswan et al., 15 Jan 2026).

1. Origin, Scope, and Goals

ManySStuBs4J ("Many Simple, Stupid Bugs for Java") was introduced to address the empirical gap in program repair research: the lack of statistically significant, realistic benchmark corpora of bugs that (a) occur naturally in the wild, (b) are localizable to single statements, and (c) can be unambiguously classified by template. The dataset encompasses 153,652 single-statement bug–fix changes mined from 1,000 popular open-source Java repositories, supplemented by a companion set of 25,539 fixes from the 100 most popular Maven projects. Each entry documents (i) the buggy line, (ii) the exact human-written fix, (iii) repository and commit-level provenance, and (iv) semantic “pattern” annotation if applicable.

The benchmark supports several explicit use cases: measuring the recall of repair approaches (especially template-based ones), facilitating studies on the empirical distribution and persistence of semantic bugs, and enabling robust comparisons of static analysis and machine learning-based quality assurance techniques (Karampatsis et al., 2019).

2. Dataset Construction and Annotation Methodology

Project selection for ManySStuBs4J relies on popularity-based ranking (z-score of forks plus stars) over two temporal snapshots of GHTorrent: April 2017 (small, Maven-constrained set) and January 2019 (large, unconstrained set). All commits are scanned; those matching "bug-fixing" heuristics in commit messages (e.g., matches for "bug," "fix," "issue," etc.) are considered candidate bug fixes (heuristic precision ≈ 96–97%). The subsequent filtering pipeline retains only modifications that introduce a single-statement change per diff hunk (AST-level, not line-based). Pure refactorings, such as identifier renamings, are filtered using static name-matching (Karampatsis et al., 2019).

Each fix is then paired as an AST subtree before/after the patch. The resulting collection is annotated by matching the edit to exactly one of 16 SStuB "templates," each formalized as a local AST mutation. Templates capture frequent semantic error classes such as identifier substitution, literal replacement, wrong function name, operator changes, or boolean condition tweaks. For each fix, the matching algorithm tests every template mutation, recording successful matches (see Section 3 for template formalizations).

3. Template Taxonomy and Formal Definitions

ManySStuBs4J’s semantic coverage rests on 16 templates, each specifying a minimal AST edit operation. Representative definitions:

Change Identifier Used: One identifier is replaced by another of the same static type.

$\exists n_1 \in A_1, n_2 \in A_2:~ n_1.\text{type} = \text{Ident} \land n_2.\text{type} = \text{Ident} \land n_1.\text{name} \neq n_2.\text{name} \land \text{Replacing}~n_1~\text{with}~n_2~\text{in}~A_1~\text{yields}~A_2$

Change Numeric Literal: A literal is changed without affecting the surrounding construct.

$\exists l_1 \in A_1, l_2 \in A_2:~ l_1.\text{kind} = l_2.\text{kind} = \text{NumericLiteral} \land l_1.\text{value} \neq l_2.\text{value} \land AST(A_1)[l_1 \mapsto l_2] = A_2$

The taxonomy further distinguishes operator changes, argument swaps, boolean literal corrections, signature modifications (e.g., adding/removing throws), among others. In total, 33.47% of all single-statement fixes match at least one template.

Template	Fixes (Large set)	% of all
Change Identifier Used	22,668	14.75
Wrong Function Name	10,179	6.62
Change Numeric Literal	5,447	3.55

(Karampatsis et al., 2019)

4. Dataset Format, Usage, and Empirical Distribution

Each benchmark instance is provided in a JSON lines format, containing the following fields: project identifier, file and commit hashes, start/end line numbers for the modified statement, source_before, source_after, serialized ASTs for before/after, and matched template label(s) (Karampatsis et al., 2019). A companion repository provides the mining scripts, matching pipeline, and (for the Maven subset) the infrastructure for compiling buggy/fixed versions.

Bug distribution exhibits a frequency of approximately one template-matching SStuB per 1,600–2,500 lines of code. The most frequent categories are identifier replacement, wrong function call, and numeric literal errors. These bugs are typically subtle, context-sensitive, and often invisible to standard static analysis (see Section 6).

5. Lifecycle Analysis and Persistence of SStuBs

A longitudinal analysis using ManySStuBs4J reveals that most SStuBs persist in codebases for extended periods before being fixed. The mean lifetime is approximately 240 days (median 58 days, $\sigma \approx 440$ days), with pronounced right skewness: a sizeable fraction of bugs remain unfixed for years. Only about 40% of SStuBs are repaired by the same author that introduced them; in these cases, the mean time to fix is 81 days (median 4 days), while for different-author fixes, the mean is 349 days (median 136 days) (Mosolygó et al., 2021).

SStuBs are predominantly introduced in new code, and self-correction by the introducer leads to faster resolution. This fact suggests the potential benefit of immediate local review and lightweight linting during early development phases.

6. Evaluation as a Benchmark: Tool Performance and Use Cases

Benchmarking with ManySStuBs4J exposes significant limitations of conventional static analysis. SpotBugs recovers 12% of SStuBs with extreme false positive rates (>200 million alerts), while PMD flags none using its default rule set. Precision is negligible; recall is poor. These results are attributed to the "micro-semantic" nature of SStuBs—errors often undetectable by global or control-flow based static analyses (Mosolygó et al., 2021).

Consequently, ManySStuBs4J has become pivotal for evaluating the recall and granularity of program repair tools, particularly template-based and LLM-based approaches. The dataset's scale, richness, and real-world provenance make it suitable for training, validation, and exposure-controlled evaluation in software defect studies and empirical software quality research (Liu et al., 20 Sep 2025, Al-Kaswan et al., 15 Jan 2026).

7. Application in LLM and Automated Repair Research

ManySStuBs4J underpins multiple recent advances in LLM-based APR and exposure-aware code model evaluation. The RelRepair system leverages project-specific signature and code snippet retrievals to enhance patch generation. In RelRepair’s study, a stratified sample of 480 SStuBs (30 per template category) was used, with fix correctness determined by exact string match. The fixed-rate metric, $\text{FixedRate} (\%) = \frac{\#~\text{exactly fixed}}{480} \times 100\%$ , yields 31.2% for ChatGPT alone, 37.7% with function signatures, and 48.3% (+17.1 pp) with code snippet retrieval, emphasizing the importance of grounding repairs in project context (Liu et al., 20 Sep 2025).

Exposure-controlled analysis shows that LLMs are more likely to generate buggy code if they have "seen" the bug in training data (as determined using Data Portraits membership probing on the Stack-v2 corpus). For 16,899 processed SStuBs, when neither bug nor fix was present in the training data, bug reproduction rates by LLMs remained substantial (e.g., 39% for “bug without fix,” with exposure to the bug raising this to over 50%). Metrics such as minimum token probability reliably prefer fixes regardless of exposure, while others (arithmetic mean, Gini) flip preference under bug-only exposure (Al-Kaswan et al., 15 Jan 2026).

Exposure	Bug w/o Fix	Fix w/o Bug	Mixed	No Match
Bug only	52%	4%	23%	21%
Fix only	45%	24%	16%	15%
Both seen	49%	21%	16%	14%
Neither seen	39%	11%	27%	23%

A plausible implication is that evaluation methodologies which do not account for exposure effects risk over- or underestimating LLM repair efficacy on such naturally occurring, single-statement bugs (Al-Kaswan et al., 15 Jan 2026).

The ManySStuBs4J benchmark represents a methodological advance in empirical software engineering, enabling high-fidelity, reproducible comparison of APR and code analysis systems on simple, yet pervasive, semantic bugs. Its coverage of realistic defect classes, detailed provenance, and adaptability for modern evaluation protocols have made it foundational in contemporary program repair and LLM-based code research (Karampatsis et al., 2019, Mosolygó et al., 2021, Liu et al., 20 Sep 2025, Al-Kaswan et al., 15 Jan 2026).