SWE-bench 500: Benchmark for Automated Patch Repair

Updated 8 February 2026

SWE-bench 500 is a curated benchmark suite comprising 500 verified Python issue-fixing tasks designed to evaluate automated software agents.
It employs a test-suite–based validation protocol that reveals limitations like hidden semantic divergences and dataset contamination.
Empirical analysis shows that leaderboard scores can overestimate true performance by up to 6 percentage points, urging more robust evaluation methods.

SWE-bench 500, most often referred to as SWE-bench Verified, is a benchmark suite of 500 real-world issue-fixing tasks for evaluating automated software engineering agents, particularly around the task of patching Python repositories based on natural language issue descriptions. It has become the de facto leaderboard for code-fixing agents, yet recent work highlights both its impact, its methodological limitations, and the critical need for more robust evaluation protocols (Wang et al., 19 Mar 2025, Pan et al., 2024, Yu et al., 10 Jun 2025, Prathifkumar et al., 11 Dec 2025).

1. Dataset Composition and Construction

SWE-bench Verified is a human-filtered subset of 500 issues from the broader 2,294-task SWE-bench dataset, which was originally constructed by scraping ∼90,000 pull requests across 12 popular open-source Python repositories (e.g., Django, SymPy, Matplotlib, SciKit-Learn) (Jimenez et al., 2023). Tasks were selected by requiring that the PR referenced a GitHub issue and modified at least one test file. The final SWE-bench 500 subset was curated by OpenAI annotators to emphasize high-quality, well-specified, reproducible issues: ambiguous, noisy, and test-unrelated tasks were removed (Wang et al., 19 Mar 2025).

Each instance in SWE-bench 500 consists of:

A repository snapshot at the buggy version,
The corresponding GitHub issue statement (title, body, and comments),
The test patch $P_t$ (new/modified tests added by the PR),
The developer-written "oracle" patch $P_o$ that fixes the issue.

Tasks span diverse bug types, enhancements, and feature additions and cover typical Python library codebases. Issue descriptions average ∼195 words, and gold patches typically modify 1–2 files and 4–10 lines of code, with many tasks requiring cross-file or semantic reasoning spanning several functions (Jimenez et al., 2023, Prathifkumar et al., 11 Dec 2025).

2. Validation Protocol and Its Limitations

The validation protocol for SWE-bench 500 adopts a strictly test-suite–based approach:

Checkout the buggy version, apply $P_t$ (so that the failing "issue" test is present).
Apply a candidate patch $P_g$ (agent-generated).
Run only those tests that were modified in the corresponding PR.
If all those tests pass, mark the issue as "solved."

This protocol is non-exhaustive: unmodified or non-PR tests are ignored, and the possibility of semantic divergences or silent regressions is not addressed (Wang et al., 19 Mar 2025, Yu et al., 10 Jun 2025). As a result, plausible patches that pass the small PR-modified suite may still fail to match the full intent or introduce regressions in the broader repository. Empirical analysis shows that re-running all developer tests, not just the PR suite, reveals that 7.8% of plausible patches fail some untested behavior, causing a 4.5 percentage point absolute drop in reported resolution rates: $\frac{\#\{\text{plausible patches that fail some other test}\}}{\#\{\text{plausible patches}\}} = 7.8\%$

$\Delta R_{\text{all-tests}} = 4.5\% \text{ (points)}$

(Wang et al., 19 Mar 2025).

Furthermore, the UTBoost framework demonstrates that over 5% of SWE-bench Verified instances have original test suites insufficient to distinguish truly correct from semantic-no-op patches, and over half the instances suffer from annotation or parser errors (Yu et al., 10 Jun 2025).

3. Faulty "Resolution": PatchDiff and Error Categorization

PatchDiff is an automated differential testing technique designed to expose semantic divergences between candidate patches ( $P_g$ ) and the ground truth ( $P_o$ ) (Wang et al., 19 Mar 2025). It operates in four stages:

Identifies "target functions" using per-test call traces.
Extracts minimal contextual code around these functions.
Prompts a LLM to generate up to 10 pytest-format differentiating tests per function, focusing on tests that pass under one patch but fail under the other.
Filters to retain only robust, non-flaky, genuinely differentiating tests.

No scalar "divergence score" is reported; it suffices that a differentiating test exists to count a plausible patch as behaviorally divergent.

Applying PatchDiff to SWE-bench 500 shows that 29.6% of plausible patches for state-of-the-art agents induce at least one behavioral divergence relative to $P_o$ . Manually inspecting a 30% sample of such "suspicious" patches finds that 28.6% are certainly incorrect, corresponding to an estimated 11.0% incorrectness rate among all plausible patches—further inflating resolution rates by an additional 6.2 percentage points: $\frac{N_{\text{suspicious}}}{N_{\text{plausible}}} = 29.6\%$

$P_o$ 0

$P_o$ 1

(Wang et al., 19 Mar 2025).

The taxonomy of divergence reveals that:

46.8% are divergent implementations of the same intended change,
27.3% are supplementary semantic changes (the agent introduces extra behavior),
5.2% omit a required fix,
20.8% are completely unaligned edits.

4. Benchmarking Impact, Agent Scores, and State of the Art

SWE-bench 500 is the standard leaderboard for open-source and proprietary code agents. Fine-tuned open-weight agents with verifiers (e.g., OpenHands CodeActAgent with Qwen-2.5-32B, + verifier) achieve up to 32.0% resolution rate under official conditions (Pan et al., 2024). However, re-analysis reveals that leaderboard scores can be inflated by 4–6 percentage points due to hidden divergences and insufficient validation (Wang et al., 19 Mar 2025, Yu et al., 10 Jun 2025).

Recent leaderboard corrections using UTBoost led to rank changes for nearly a quarter of agents: after rescoring with improved test suites and parsers, 24.4% of agent ranks were altered for SWE-bench Verified (Yu et al., 10 Jun 2025). A significant fraction of high-resolution agents likely overfitted to the benchmark or benefited from previously unrevealed insufficiencies.

5. Test-Set Contamination and Validity Concerns

Evidence from recent studies demonstrates that SWE-bench 500 is highly susceptible to dataset contamination due to its construction before common LLM pretraining cutoffs (Prathifkumar et al., 11 Dec 2025). Localization experiments show that LLMs can identify the correct files to edit at rates ~3-6× higher than on fresh (unseen) benchmarks, even with "ticket-only" (no code context) input—strongly suggesting memorization effects. For instance, the all-correct file localization accuracy on SWE-bench Verified reaches 65–76% with only issue text or file structure, compared to 8–21% on fresh benchmarks (BeetleBox, SWE-rebench) (Prathifkumar et al., 11 Dec 2025).

This empirical result implies that high leaderboard performance on SWE-bench 500 may reflect model recall, not true agent engineering or generalization. Benchmarks are now shifting toward continuous decontamination and post-cutoff issue injection.

6. Recommendations and Future Directions

The consensus from recent literature is that SWE-bench 500, while foundational, cannot be viewed as a foolproof or contamination-resistant oracle (Wang et al., 19 Mar 2025, Yu et al., 10 Jun 2025, Prathifkumar et al., 11 Dec 2025). Key recommendations include:

Upgrade validation by always running the entire developer test suite rather than only PR-modified tests, as ~8% of "passed" patches mask regressions.
Employ automated techniques like PatchDiff to detect semantic divergence and supplement test suites with focused differentiating tests, thereby systematically hardening validation.
Correct leaderboard reporting by adopting practices such as automatic LLM-driven test-case augmentation (e.g., UTBoost), robust multi-line log parsing, and equivalence-based (intramorphic) patch comparison.
Transition toward dynamic benchmarks (e.g., SWE-rebench) that continuously append post-cutoff issues, apply explicit decontamination, and quantify overlap with model training data (Prathifkumar et al., 11 Dec 2025).
Recognize that test-centric validation alone is insufficient for semantic correctness; benchmarks must incorporate coverage augmentation and manual review for under-specified or ambiguous issues.

Unless these criteria are met, pass-rate metrics on SWE-bench 500 risk overestimating the true capabilities of software engineering agents, stalling methodological progress and misrepresenting actual advances in code synthesis. The incorporation of automated test diversification and semantic patch-differentiation now represents the minimum standard for reliable issue-solving evaluation on this canonical 500-task suite.