SWE-smith Pipeline: Scalable Automation
- SWE-smith pipeline is a fully automated system that synthesizes hundreds to thousands of bug-and-fix Python tasks per codebase, vastly reducing manual labor.
- It constructs a shared, reusable execution environment from a repository, optimizing Docker image scaling and ensuring 80%+ test pass rates.
- The pipeline underpins state-of-the-art benchmarks like the 32B-parameter SWE-agent-LM and DockSmith models, driving significant performance improvements.
The SWE-smith pipeline is a fully automated, scalable system for generating, validating, and packaging large quantities of high-quality software engineering (SWE) agent training data and execution environments. Motivated by persistent data scarcity and prohibitive manual labor requirements in prior SWE LLM efforts, SWE-smith synthesizes hundreds to thousands of “bug-and-fix” Python tasks per codebase, each bundled with a reusable, validated execution environment. This design supports the training and evaluation of advanced SWE agents, yielding datasets and task coverage orders of magnitude beyond earlier works. The pipeline has directly enabled several state-of-the-art open-source benchmarks—most notably enabling the 32B-parameter SWE-agent-LM and DockSmith models—and underpins a family of scalable agentic environment construction and data curation methodologies for AI-driven software engineering (Yang et al., 30 Apr 2025, Zhang et al., 31 Jan 2026).
1. Objectives and Motivation
SWE-smith targets three primary bottlenecks in scaling training for software engineering LLMs: limited data volume (existing datasets contain at most thousands of instances from a handful of repositories), the high cost and manual effort of environment construction/validation, and unsustainable storage requirements posed by per-instance containerization. SWE-smith inverts traditional PR-centric curation by first constructing a shared executable environment per-repo, then synthesizing a large set of task instances that break or fix existing tests, reducing human hours and storage needs by up to 500× relative to per-instance isolation (Yang et al., 30 Apr 2025). The pipeline thus unlocks high-throughput generation of programmatically validated, execution-grounded tasks for LMs, supporting data-driven advances in both supervised and RL-based agent training.
2. Pipeline Stages
The SWE-smith pipeline is logically divided into three key stages: a) environment construction, b) task/bug synthesis, and c) instance validation and filtering.
2.a Environment Construction
Given a cloned Python repository at a designated commit, SWE-smith’s environment builder (env_builder) discovers installation and test commands, selects package and interpreter versions, and validates that at least 80% of the repository’s tests pass. The environment specification encodes installation commands , discovered test commands , and the explicit Python dependency graph , where is the set of installed packages and captures dependency edges as inferred from package manifests and pip freeze. The resulting environment is instantiated via a single, reusable Dockerfile—shared by all synthetic tasks for the given repo/commit—yielding , not , image scaling (Yang et al., 30 Apr 2025).
2.b Synthetic Task Generation
For each execution environment , SWE-smith synthesizes candidate bug-inducing patches , each of which is validated by its effect on existing tests. Synthesis leverages four strategies:
- LM Modify: Prompting an LLM to insert subtle bugs into function/class bodies.
- LM Rewrite: Having the LLM re-implement entities from type signatures.
- Procedural Mod: Applying AST-based code transformations (e.g., inverting conditions, reference swaps).
- PR Mirror: Inverting diffs from actual PRs using LLMs.
For source , with passing test suite , a patch is retained if , where . Task complexity is diversified by combining up to entity-level patches within and across files, using conflict-free merges (Yang et al., 30 Apr 2025).
2.c Instance Filtering and Selection
Instance selection proceeds by deduplication (removal of identical diffs), yield filtering (enforcing bounds on patch size and failure counts), and difficulty scoring using lightweight classifiers. Instances are scored as , where is the difficulty score predicted for each patch, and normalized within-repo (Yang et al., 30 Apr 2025). Top instances per repository are sampled to ensure a balanced distribution of bug types and challenge.
3. Scaling, Dataset Composition, and Automation
Applied to 128 real-world Python repositories (excluding SWE-bench’s 12), the SWE-smith pipeline produces over 50,000 fully validated training tasks. Table 1 in (Yang et al., 30 Apr 2025) reports yields, cost, and small/large-patch statistics by synthesis strategy:
| Strategy | Yield % | #Instances | Median #T_fail | Median Lines Edited |
|---|---|---|---|---|
| LM Modify | 56 | 17,887 | 4 | 3 |
| LM Rewrite | 35 | 4,173 | 4 | 24 |
| Procedural | 40 | 15,641 | 7 | 5 |
| PR Mirror | 34 | 2,344 | 3 | 14 |
| Combine files/modules | 97 | 10,092 | 15 | 11 |
| Total | — | 50,137 | 6 | 5 |
The typical repo yields a median 381 tasks (IQR 157–652), demonstrating high scalability. The entire pipeline, including build, test, and patch validation, is controlled by scripts requiring minimal operator intervention (∼20 human hours; 128 Docker images; ∼295 GB total image storage) (Yang et al., 30 Apr 2025).
4. Execution-Grounded Agentic Environment Construction
A critical application of SWE-smith data is large-scale agentic Docker environment construction, as demonstrated in DockSmith (Zhang et al., 31 Jan 2026). Here, agent-based workflows—comprising context retrieval, Dockerfile patching, eval-script synthesis, and iterative test/error analysis—operate in a multi-agent repair loop augmented by a loop-detection controller and cross-task memory. Each trajectory state includes the repository snapshot, Dockerfile, eval script, failure signature, and controller history, with actions including Dockerfile patch, eval-script patch, build+test, exemplar retrieval, or diversification.
A cross-task success memory stores verified (Dockerfile, script) pairs keyed by manifest-derived context signatures. On each new task, memory retrieval supplies top- exemplars to the LLM, accelerating convergence. The loop-detection controller diversifies agent choice upon detecting non-progressive loops (Zhang et al., 31 Jan 2026). Empirically, the DockSmith agent achieves 39.72% Fail-to-Pass and 58.28% Commit Rate on Multi-Docker-Eval (Zhang et al., 31 Jan 2026).
5. Evaluation Methodologies and Performance Metrics
SWE-smith deploys rigorous, execution-based validation for both synthetic tasks and the agents trained on them. The standard Pass@k resolves metric is computed as
where is the number of tasks, is the number of independent samples, and the number of successful completions for task . In single-attempt scenarios , Pass@1 reduces to the resolved fraction (Yang et al., 30 Apr 2025).
For environment agents, Multi-Docker-Eval adopts Fail-to-Pass (the fraction of tasks successfully transitioned from failing to passing test state) and Commit Rate (fraction yielding a terminal solution) as primary outcome metrics (Zhang et al., 31 Jan 2026).
6. Comparative Impact and Open Source
SWE-smith has established a new scale for open-source SWE agent research, yielding the largest and most diverse public dataset for this domain. Models trained on SWE-smith data, such as SWE-agent-LM-32B, achieve 40.2% Pass@1 on SWE-bench Verified—outperforming prior open models by over 7 percentage points (Yang et al., 30 Apr 2025). Container-level storage innovations and incremental filtering have brought terabyte-scale requirements down to hundreds of gigabytes. The pipeline, artifacts, and pre-built environments are publicly available, lowering the barrier for reproducible evaluation, further experimentation, and RL fine-tuning with minimal manual overhead (Yang et al., 30 Apr 2025).
The SWE-smith methodology has directly influenced subsequent pipelines such as Skywork-SWE and RepoForge, and forms the environment/task synthesis backbone for advanced agentic benchmarks—most notably enabling robust multi-agent Docker build environments as seen in DockSmith (Zhang et al., 31 Jan 2026) and rapid, parallel RL fine-tuning in RepoForge (Chen et al., 3 Aug 2025).