SWE-smith Pipeline: Scalable Automation

Updated 6 February 2026

SWE-smith pipeline is a fully automated system that synthesizes hundreds to thousands of bug-and-fix Python tasks per codebase, vastly reducing manual labor.
It constructs a shared, reusable execution environment from a repository, optimizing Docker image scaling and ensuring 80%+ test pass rates.
The pipeline underpins state-of-the-art benchmarks like the 32B-parameter SWE-agent-LM and DockSmith models, driving significant performance improvements.

The SWE-smith pipeline is a fully automated, scalable system for generating, validating, and packaging large quantities of high-quality software engineering (SWE) agent training data and execution environments. Motivated by persistent data scarcity and prohibitive manual labor requirements in prior SWE LLM efforts, SWE-smith synthesizes hundreds to thousands of “bug-and-fix” Python tasks per codebase, each bundled with a reusable, validated execution environment. This design supports the training and evaluation of advanced SWE agents, yielding datasets and task coverage orders of magnitude beyond earlier works. The pipeline has directly enabled several state-of-the-art open-source benchmarks—most notably enabling the 32B-parameter SWE-agent-LM and DockSmith models—and underpins a family of scalable agentic environment construction and data curation methodologies for AI-driven software engineering (Yang et al., 30 Apr 2025, Zhang et al., 31 Jan 2026).

1. Objectives and Motivation

SWE-smith targets three primary bottlenecks in scaling training for software engineering LLMs: limited data volume (existing datasets contain at most thousands of instances from a handful of repositories), the high cost and manual effort of environment construction/validation, and unsustainable storage requirements posed by per-instance containerization. SWE-smith inverts traditional PR-centric curation by first constructing a shared executable environment per-repo, then synthesizing a large set of task instances that break or fix existing tests, reducing human hours and storage needs by up to 500× relative to per-instance isolation (Yang et al., 30 Apr 2025). The pipeline thus unlocks high-throughput generation of programmatically validated, execution-grounded tasks for LMs, supporting data-driven advances in both supervised and RL-based agent training.

2. Pipeline Stages

The SWE-smith pipeline is logically divided into three key stages: a) environment construction, b) task/bug synthesis, and c) instance validation and filtering.

2.a Environment Construction

Given a cloned Python repository at a designated commit, SWE-smith’s environment builder (env_builder) discovers installation and test commands, selects package and interpreter versions, and validates that at least 80% of the repository’s tests pass. The environment specification $E = \langle \text{repo}, \text{commit}, C, T, G \rangle$ encodes installation commands $C$ , discovered test commands $T$ , and the explicit Python dependency graph $G = (V, E_1)$ , where $V$ is the set of installed packages and $E_1$ captures dependency edges as inferred from package manifests and pip freeze. The resulting environment is instantiated via a single, reusable Dockerfile—shared by all synthetic tasks for the given repo/commit—yielding $O(|\text{repos}|)$ , not $O(|\text{instances}|)$ , image scaling (Yang et al., 30 Apr 2025).

2.b Synthetic Task Generation

For each execution environment $E$ , SWE-smith synthesizes candidate bug-inducing patches $P$ , each of which is validated by its effect on existing tests. Synthesis leverages four strategies:

LM Modify: Prompting an LLM to insert subtle bugs into function/class bodies.
LM Rewrite: Having the LLM re-implement entities from type signatures.
Procedural Mod: Applying AST-based code transformations (e.g., inverting conditions, reference swaps).
PR Mirror: Inverting diffs from actual PRs using LLMs.

For source $C$ 0, with passing test suite $C$ 1, a patch $C$ 2 is retained if $C$ 3, where $C$ 4. Task complexity is diversified by combining up to $C$ 5 entity-level patches within and across files, using conflict-free merges (Yang et al., 30 Apr 2025).

2.c Instance Filtering and Selection

Instance selection proceeds by deduplication (removal of identical diffs), yield filtering (enforcing bounds on patch size and failure counts), and difficulty scoring using lightweight classifiers. Instances are scored as $C$ 6, where $C$ 7 is the difficulty score predicted for each patch, and normalized within-repo (Yang et al., 30 Apr 2025). Top $C$ 8 instances per repository are sampled to ensure a balanced distribution of bug types and challenge.

3. Scaling, Dataset Composition, and Automation

Applied to 128 real-world Python repositories (excluding SWE-bench’s 12), the SWE-smith pipeline produces over 50,000 fully validated training tasks. Table 1 in (Yang et al., 30 Apr 2025) reports yields, cost, and small/large-patch statistics by synthesis strategy:

Strategy	Yield %	#Instances	Median #T_fail	Median Lines Edited
LM Modify	56	17,887	4	3
LM Rewrite	35	4,173	4	24
Procedural	40	15,641	7	5
PR Mirror	34	2,344	3	14
Combine files/modules	97	10,092	15	11
Total	—	50,137	6	5

The typical repo yields a median 381 tasks (IQR 157–652), demonstrating high scalability. The entire pipeline, including build, test, and patch validation, is controlled by scripts requiring minimal operator intervention (∼20 human hours; 128 Docker images; ∼295 GB total image storage) (Yang et al., 30 Apr 2025).

4. Execution-Grounded Agentic Environment Construction

A critical application of SWE-smith data is large-scale agentic Docker environment construction, as demonstrated in DockSmith (Zhang et al., 31 Jan 2026). Here, agent-based workflows—comprising context retrieval, Dockerfile patching, eval-script synthesis, and iterative test/error analysis—operate in a multi-agent repair loop augmented by a loop-detection controller and cross-task memory. Each trajectory state $C$ 9 includes the repository snapshot, Dockerfile, eval script, failure signature, and controller history, with actions including Dockerfile patch, eval-script patch, build+test, exemplar retrieval, or diversification.

A cross-task success memory stores verified (Dockerfile, script) pairs keyed by manifest-derived context signatures. On each new task, memory retrieval supplies top- $T$ 0 exemplars to the LLM, accelerating convergence. The loop-detection controller diversifies agent choice upon detecting non-progressive loops (Zhang et al., 31 Jan 2026). Empirically, the DockSmith agent achieves 39.72% Fail-to-Pass and 58.28% Commit Rate on Multi-Docker-Eval (Zhang et al., 31 Jan 2026).

5. Evaluation Methodologies and Performance Metrics

SWE-smith deploys rigorous, execution-based validation for both synthetic tasks and the agents trained on them. The standard Pass@k resolves metric is computed as

$T$ 1

where $T$ 2 is the number of tasks, $T$ 3 is the number of independent samples, and $T$ 4 the number of successful completions for task $T$ 5. In single-attempt scenarios $T$ 6, Pass@1 reduces to the resolved fraction (Yang et al., 30 Apr 2025).

For environment agents, Multi-Docker-Eval adopts Fail-to-Pass (the fraction of tasks successfully transitioned from failing to passing test state) and Commit Rate (fraction yielding a terminal solution) as primary outcome metrics (Zhang et al., 31 Jan 2026).

6. Comparative Impact and Open Source

SWE-smith has established a new scale for open-source SWE agent research, yielding the largest and most diverse public dataset for this domain. Models trained on SWE-smith data, such as SWE-agent-LM-32B, achieve 40.2% Pass@1 on SWE-bench Verified—outperforming prior open models by over 7 percentage points (Yang et al., 30 Apr 2025). Container-level storage innovations and incremental filtering have brought terabyte-scale requirements down to hundreds of gigabytes. The pipeline, artifacts, and pre-built environments are publicly available, lowering the barrier for reproducible evaluation, further experimentation, and RL fine-tuning with minimal manual overhead (Yang et al., 30 Apr 2025).

The SWE-smith methodology has directly influenced subsequent pipelines such as Skywork-SWE and RepoForge, and forms the environment/task synthesis backbone for advanced agentic benchmarks—most notably enabling robust multi-agent Docker build environments as seen in DockSmith (Zhang et al., 31 Jan 2026) and rapid, parallel RL fine-tuning in RepoForge (Chen et al., 3 Aug 2025).

Markdown Report Issue Upgrade to Chat

References (3)

SWE-smith: Scaling Data for Software Engineering Agents (2025)

DockSmith: Scaling Reliable Coding Environments via an Agentic Docker Builder (2026)

RepoForge: Training a SOTA Fast-thinking SWE Agent with an End-to-End Data Curation Pipeline Synergizing SFT and RL at Scale (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SWE-smith Pipeline.