Papers
Topics
Authors
Recent
Search
2000 character limit reached

SWE-Bench Leaderboards

Updated 8 February 2026
  • SWE-Bench leaderboards are standardized platforms that evaluate automated software repair systems using real-world GitHub issues.
  • They employ clear metrics like resolved rate and pass@k across diverse languages to measure accuracy and resource efficiency.
  • Their workflow involves Dockerized testing, strict metadata protocols, and continuous integration to ensure reproducible and robust benchmarking.

A SWE-Bench leaderboard is a standardized, public platform for evaluating, ranking, and comparing automated software repair systems—typically LLM–based agents—on realistic, repository-scale code-editing tasks derived from real-world GitHub issues and pull requests. These leaderboards implement rigorous, reproducible testing pipelines with curated datasets, formally defined metrics, submission protocols, and continuous integration, thereby serving as a primary means of measuring progress in automated program repair, agent system design, and LLM-driven software engineering. Over time, SWE-Bench leaderboards have expanded beyond Python and now encompass multilingual and stateful agent benchmarks, advanced multi-resource effectiveness metrics, and rigorous validations against data contamination and test insufficiency.

1. Origins, Scope, and Structure of SWE-Bench Leaderboards

SWE-Bench was introduced as an evaluation framework constructed from 2 294 real-world software engineering tasks, each pairing a natural-language GitHub issue and its corresponding code-modifying pull request, spanning 12 popular Python repositories (Jimenez et al., 2023). SWE-Bench leaderboards centralize evaluation by enforcing uniform task definitions: given an issue and a full codebase, a candidate system must generate a patch so that all designated failed tests now pass, without breaking previously passing tests. The main public leaderboards are:

Each leaderboard instance is a single, self-contained “issue resolution” task. There is no explicit weighting—all instances contribute equally.

Extensions now include:

  • SWE-bench-java-verified: The first officially supported non-Python leaderboard, evaluating 91 curated Java issue-patch pairs with Dockerized build/test harnesses (Zan et al., 2024).
  • SWE-Bench++: An automated pipeline producing >11 100 repository-level tasks from 11 languages. The leaderboard evaluates a 1 782-task stratified subset, supporting multilingual, multi-paradigm comparison (Wang et al., 19 Dec 2025).
  • Ambiguous and Stateful Leaderboards: Newer benchmarks that assess agents’ ability to interact with ambiguous user instructions or persist user state across sessions (Zhou et al., 24 Oct 2025).

These leaderboards operate as static websites or web portals, with machine-readable submission formats and integration with CI/CD pipelines for automated evaluation.

2. Metrics, Scoring, and Evaluation Protocols

The principal ranking metric is the Resolved Rate (“Pass@1” or “precision”), defined for an evaluation set of NN issues as

Precision=#Resolved IssuesN×100%\text{Precision} = \frac{\#\,\text{Resolved Issues}}{N}\times 100\%

where resolution requires that the patch passes all specified tests (Martinez et al., 4 Feb 2026, Martinez et al., 20 Jun 2025, Zan et al., 2024).

Secondary or context-specific metrics include:

  • Apply Rate: Fraction of issues where the generated patch can be applied without error.
  • Pass@k: For code-generation benchmarks sampling kk completions per issue, the proportion of tasks for which any candidate passes the entire test suite (e.g., pass@10 in SWE-Bench++) (Wang et al., 19 Dec 2025).
  • Recall@k, Hit-All@k, Hit-Any@k: For localization or retrieval subtasks (i.e., predicting edited files/classes given an issue description), Recall@k and similar metrics quantify retrieval accuracy (Prathifkumar et al., 11 Dec 2025).
  • Effectiveness AUCs: In SWE-Effi, resource-aware area-under-curve scores (e.g., token-, time-, cost-bounded effectiveness) capture the trade-off between accuracy and resource expenditure (Fan et al., 11 Sep 2025).

Certain leaderboards implement auxiliary metrics:

Top-line leaderboard tables list methods, core metrics, and sometimes breakdowns by repository or language.

3. Submission Workflow and Maintenance

Leaderboard entries must adhere to well-defined submission schemas, including:

  • Metadata: Model/agent name, LLM version, training details, compute resources, date, and (optionally) cost (Zan et al., 2024).
  • Per-issue Results: JSON or tabular outputs denoting which issues are resolved, including pass/fail for each instance.
  • Aggregate Score: Summary files recording the core resolved count, fraction, and submission timestamp.

Typical submission protocol:

  1. Prepare agent patches/code and run evaluations in a Dockerized container against the provided evaluation harness (language-specific: e.g., Maven/Gradle for Java).
  2. Aggregate raw results and score via supplied scripts.
  3. Submit outputs (results, summary, metadata) via pull request to the leaderboard’s public repository.
  4. Continuous integration is triggered: the maintainers’ infrastructure re-runs evaluations, checks output consistency, validates schema, and merges accepted results into the live leaderboard.

For multi-language or expanding leaderboards, new language splits require a catalog of issue instances, build commands, and appropriate Docker toolchains (Zan et al., 2024, Wang et al., 19 Dec 2025).

System design is highly heterogeneous (Martinez et al., 20 Jun 2025, Martinez et al., 4 Feb 2026):

  • Agentless/workflow approaches: Fixed, human-authored pipelines with single-round verifiable steps.
  • SWE-Agent frameworks: Multi-turn, emergently autonomous systems capable of file localization, patch generation, test execution, and iterative refinement.
  • Hybrid and multi-agent systems: Ensembling, LLM-as-judge, retrieval-augmented, or tool-integrated designs.

Dominant trends:

  • Proprietary LLMs (notably the Anthropic Claude family, GPT-4/5, Gemini) have consistently topped leaderboards, particularly in Verified splits (Martinez et al., 4 Feb 2026).
  • Industry, especially small and large publicly traded companies, account for the majority of top-performing submissions; academia and single-developer efforts remain visible but less prevalent at the highest ranks (Martinez et al., 20 Jun 2025, Martinez et al., 4 Feb 2026).
  • Overfitting (patches passing available tests but semantically incorrect) is prevalent even in human-audited splits, indicating the need for stronger oracle design and expanded validation (Yu et al., 10 Jun 2025).
  • Open-source LLMs (Qwen, LLaMA, Mistral) are increasingly competitive when leveraged in ensemble or pipeline components but remain generally behind best-in-class proprietary models (Martinez et al., 4 Feb 2026, Martinez et al., 20 Jun 2025).

Architecture classification (example from (Martinez et al., 20 Jun 2025)):

Group Description Max Precision (Verified)
G1 No agent, fixed workflow 50.8%
G4 Scaffolded, single-agent 70.8%
G6 Emergent control, single-agent 73.2%
G7 Emergent control, multi-agent 62.2%

No single workflow dominates; high-performing systems typically blend elements (retrieval, orchestration, self-critique).

5. Benchmark Integrity: Contamination, Test Sufficiency, and Best Practices

Multiple studies have demonstrated that leaderboard accuracy can be inflated by contamination and incomplete test oracles:

  • Contamination: SWE-Bench-Verified overlaps significantly with LLM pretraining data: models are 3–6× more accurate in localizing bug locations on this benchmark than on held-out or decontaminated sets (Prathifkumar et al., 11 Dec 2025). Reported contamination rates reach 8–10%. This suggests that apparent generalization may instead reflect training recall.
  • Test Suite Insufficiency: The UTBoost framework reveals that ~41% of Lite and 24% of Verified leaderboard entries were mis-scored due to inadequate or incorrectly parsed test suites, affecting up to 345 unique patch assessments (Yu et al., 10 Jun 2025). Intramorphic augmentation with automated tests corrects both false positives and ranking jitter.
  • Methodological Guidance: Maintain explicit versioning, perform continuous re-minimization (e.g., with BISS (Matricon et al., 8 Sep 2025)) when variants change, require metadata documentation, and rotate test sets to mitigate memorization and saturation.
  • Leaderboard Evolution: Recent recommendations include freezing contaminated public splits and transitioning to live, rotating, or decontaminated benchmarks (e.g., SWE-Bench++, SWE-rebench), with explicit reporting of contamination rates and confidence intervals (Prathifkumar et al., 11 Dec 2025, Wang et al., 19 Dec 2025).

6. Recent Extensions: Multilingual, Resource-Constrained, and Stateful Leaderboards

The SWE-Bench ecosystem has broadened to support:

  • Multilingual and Non-Python Leaderboards: SWE-bench-java and SWE-Bench++ enable code agent evaluation on Java, C/C++, Rust, Go, and others, reporting language-specific pass@k metrics (Zan et al., 2024, Wang et al., 19 Dec 2025).
  • Resource-Constrained Leaderboards: SWE-Effi re-ranks leading agents not only by accuracy but by token, time, and cost effectiveness, surfacing the phenomenon of “token snowball” and “expensive failures” in agentic frameworks (Fan et al., 11 Sep 2025).
  • User/Interaction-Aware Tracks: Ambiguous and stateful SWE-Bench variants benchmark the capacity of systems such as ToM-SWE to model and persist user intent, with significant task success and satisfaction improvements over non-ToM agents (Zhou et al., 24 Oct 2025).

These developments reflect a shift from static, Python-centric, accuracy-only leaderboards toward dynamic, multidimensional, agent- and environment-aware evaluation frameworks for AI-powered software engineering agents.

7. Representative Leaderboard Snapshots and Key Results

Example: SWE-Bench Verified (Python), “Pass@1” Leaderboard (2025)

Rank Agent/System Score (%)
1 EPAM AI/Run Dev Agent (Claude 4 Sonnet) 76.8
2 Trae Agent (Claude 3.7 Sonnet + Opus + GPT-5 + Gem2.5) 75.2
3 Bytedance TRAE LLM Agent (Claude 4 Opus) 73.2

Example: SWE-bench-java-verified (Java), “Resolved Rate” (2024)

Model Resolved Rate (%)
DeepSeek-V2-0628 9.89
DeepSeekCoder-V2-0724 7.69
GPT-4o 6.59

Example: SWE-Bench++ (Multilingual), “pass@10” (2025)

Model Overall pass@10 (%)
claude-sonnet-4.5 36.20
gpt-5-2025-08-07 34.57
gemini-2.5-pro 24.92
gpt-4o 16.89

A plausible implication is that as leaderboards expand in language and evaluation rigor, the definition of “state-of-the-art” becomes sensitive not only to raw LLM capability but also to resource constraints, test oracle quality, benchmark contamination, and the agent-system’s architecture. The prevailing best practice is to report both accuracy and resource effectiveness, alongside strict documentation and contamination audits, to ensure leaderboard scores reflect real progress in automated software engineering (Wang et al., 19 Dec 2025, Yu et al., 10 Jun 2025, Fan et al., 11 Sep 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SWE-Bench Leaderboards.