SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale

Published 27 Feb 2026 in cs.SE and cs.CL | (2602.23866v1)

Abstract: Software engineering agents (SWE) are improving rapidly, with recent gains largely driven by reinforcement learning (RL). However, RL training is constrained by the scarcity of large-scale task collections with reproducible execution environments and reliable test suites. Although a growing number of benchmarks have emerged, datasets suitable for training remain limited in scale and diversity or often target a limited set of high-resource language ecosystems. We introduce SWE-rebench V2, a language-agnostic automated pipeline for harvesting executable real-world SWE tasks and constructing RL training environments at scale. The pipeline synthesizes repository-specific installation and test procedures via an interactive setup agent, and filters unsound instances using an ensemble of LLM judges, validated against human-verified SWE-bench annotations. Using this pipeline, we construct a dataset of 32,000+ tasks spanning 20 languages and 3,600+ repositories, with pre-built images for reproducible execution. To further scale training data, we additionally release 120,000+ tasks with installation instructions, fail-to-pass tests and rich metadata, where the problem statement is generated based on the original pull request description. We validate the collected instances through a diagnostic study that covers a subset of tasks in five programming languages across seven popular models, and provide instance-level metadata that flags common confounders such as overly restrictive tests and underspecified descriptions. We release the datasets, the collection and execution code, and associated artifacts to enable large-scale training of SWE agents across diverse languages and repositories.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a fully automated pipeline that harvests, validates, and enriches over 32,000 containerized tasks and 120,000+ PR-derived tasks across 20 programming languages.
The methodology integrates an interactive LLM-driven setup synthesis with ensemble quality filtering, achieving pass@ rates up to 62.7% for cross-language configurations.
The study offers practical insights for training robust RL-based software engineering agents using rich metadata and iterative diagnostics for curriculum design.

SWE-rebench V2: Scalable, Language-Agnostic Task Collection for SWE Agent Training

Motivation and Contributions

SWE-rebench V2 addresses the core limitation in training and evaluation of software engineering (SWE) agents: the lack of large-scale, reproducible, diverse, and language-agnostic interactive environments suitable for reinforcement learning (RL). By moving beyond Python-centric benchmarks and evaluation-only corpora, this pipeline establishes a unified, automated approach for collecting and validating real-world issue-resolution tasks spanning 20 programming languages and thousands of open-source repositories.

The paper’s central contributions are:

Introduction and public release of a fully automated pipeline capable of harvesting, building, and verifying over 32,000 containerized SWE tasks from 3,600+ repositories, with installation/test recipes, pre-built environments, and instance-level diagnostics.
A PR-derived extension yielding an additional 120,000+ tasks, broadening training substrate coverage beyond issue-linked PRs via synthesized problem statements.
Rich metadata generation per-instance, including confounder labels (such as test-reliance artifacts or underspecification), PR category, and difficulty.
Ablative studies quantifying yield and bottlenecks at each step (setup synthesis, clarity filtering) across languages, and model-based diagnosis of both pipeline and SWE agent performance.

Pipeline Architecture and Methodology

The pipeline is constructed as a language-agnostic but extensible workflow, using modular templates for base Docker images, test runners, and log parsers. It incorporates an interactive LLM-driven agent for setup synthesis and leverages an ensemble of LLMs for quality filtering, with their outputs calibrated against human-verified ground truth from prior benchmark datasets.

Main Stages:

Mining and Filtering: Large-scale PR and issue histories are harvested from GitHub Archive, emphasizing permissive-licensed repositories and ensuring candidate PRs both reference issues and modify or add tests. Strict star/issue thresholds prioritize yield versus diversity across language resource levels.
Setup Synthesis: For each repository, base Docker images are auto-generated per language/toolchain, while installation and test instructions are inferred and iteratively debugged by an interactive agent. This approach notably outperforms non-interactive (one-shot) pipelines in cross-ecosystem settings, especially as language/toolchain complexity increases.
Execution-based Validation: Patch/test execution is fully containerized. Paired fail/pass test runs validate candidate environments, ensuring tasks present actionable fail-to-pass signals for RL training/evaluation.
Quality Filtering: Issue clarity and alignment are filtered via ensemble LLM annotation, prioritizing high-precision exclusion of underspecified instances. Prompts and model ensembles are selected based on maximizing alignment with human-labeled ground-truth.
Metadata Enrichment: Each task is annotated with diagnostic codes reflecting environment and test pathologies, task type, difficulty, and explicit interface signatures. These support downstream stratified training, task selection, and curriculum design.
PR-based Corpus Expansion: For repositories with successful setups, PRs not directly linked to issues are incorporated by synthesizing task statements from PR text and diffs, expanding training coverage.

Quantitative Results and Experimental Findings

The release comprises 32,000+ rigorously validated tasks, with wide programming language coverage (led by Python and Go, followed by significant representation from JS/TS/Rust/Scala). Additionally, 120,000+ PR-derived tasks are provided for large-scale learning. Distributional analysis confirms broad coverage across difficulty (with a significant heavy tail for hard/complex issues) and PR categories (e.g., bugs, regression, integration, UI, documentation, security).

Key experimental observations:

Setup Synthesis: Interactive agentic pipelines, particularly those leveraging models such as Qwen3-Coder-480B-A35B-Instruct, substantially outperform non-interactive baselines (pass@10 up to 62.7% vs. 15.7%) on cross-language configuration tasks. Increased agent retries further improve setup yield.
Quality Filtering: LLM ensemble consensus using highly precision-oriented prompts achieves strong alignment with human-labeled clarity assessments (precision up to 0.83 for verified-e prompts). Averaged ensemble outputs improve robustness compared to single-model annotation.
Agent Baseline Performance: On a 300-task, five-language diagnostic set, state-of-the-art frontier models (e.g., Claude Opus-4.5, GLM-4.7) achieve pass@1 rates up to 36.1% (Python) and 15–28% (Go, Rust, JS), confirming the dataset presents considerable challenge for current agents, with substantial variance by language and substantial headroom for model improvement.

Implications and Limitations

Theoretical Implications

The crucial finding is that high-confidence, containerized, language-agnostic SWE agent environments can be generated at scale without per-instance human intervention, as long as iterative, agentic approaches to setup and LLM-based filtering and diagnostics are applied. The pipeline also demonstrates the feasibility and importance of aligning automated verification with domain-specific ground truth via regular diagnostic feedback and instance-level metadata.

Practical Implications

SWE-rebench V2 provides a practical foundation for RL-based and other agent learning at repository scale, supporting both curriculum learning (via metadata-based filtering) and robustness training (via progressive inclusion of noisy/confounded tasks). Its extensible nature offers a path for onboarding novel languages, incorporating long-horizon, multi-service tasks, and coupling with richer reward signals beyond test-based criteria.

Limitations and Future Challenges

Environment Complexity: The current approach is Docker/container-centric, not supporting projects requiring complex distributed multi-service setups (e.g., microservices, DB/service integration). Extending support to such systems is an identified target for subsequent research.
No Agent Training Ablations: While instance-level diagnostics are provided, no ablation studies assess the effect on downstream agent performance of different filtering/curriculum strategies. Future work should quantify these effects.
Noise and Pathology: Inevitably, automated pipelines introduce a level of reward noise and environmental pathology. The authors address this with extensive diagnostics and filtering, but outliers remain, especially in long-tail repositories or for exotic toolchains.

Conclusion

SWE-rebench V2 establishes a new paradigm for large-scale, language-agnostic interactive task dataset collection for SWE agent research. Its public release—of both executable, containerized environments and large PR-derived task corpora, with granular metadata—removes a primary bottleneck for training and evaluating RL-based software engineering agents. This resource will catalyze research on multilingual, robust, and curriculum-aware agent learning, and informs future directions in environment automation, cross-ecosystem RL, and complex software agent benchmarking (2602.23866).

Markdown Report Issue