SWE-Factory-Style Pipeline

Updated 7 February 2026

SWE-Factory-style pipeline is an end-to-end modular workflow that automatically extracts, constructs, validates, and updates software engineering tasks from open-source repositories.
It employs micro-service stages and multi-agent orchestration to ensure reproducibility, scalability, and rigorous pass/fail validation with low data contamination.
This systematic approach efficiently generates high-fidelity benchmarks for LLM-based agents, improving training outcomes and performance evaluation.

A Software Engineering Factory-style Pipeline (“SWE-Factory-style pipeline”) is an end-to-end, modular, fully automated workflow for extracting, constructing, validating, and continuously updating software engineering (SWE) tasks and benchmarks from raw open-source repository activity. The paradigm has become central in large-scale evaluation, training, and benchmarking of LLM-based software engineering agents because it guarantees scalability, reproducibility, low data contamination, and robust pass/fail validation. These pipelines are architected as a series of atomic micro-service stages, each with tightly defined input/output contracts, orchestration logic, and automated error-handling, emulating an assembly line in an industrial factory. This systematic approach enables the efficient, regular generation of high-fidelity datasets and benchmarks for agent-centric code tasks, from issue resolution to environment reconstruction, performance optimization, and continuous deployment (Zhang et al., 29 May 2025, Guo et al., 12 Jun 2025).

1. Pipeline Architecture and Sequential Staging

The core structure of a SWE-Factory-style pipeline is a modular chain of stages, designed for composability and scalability. In representative systems such as SWE-bench-Live (“RepoLaunch”), SWE-Factory, and similar pipelines, the canonical flow involves the following major components:

Issue/Data Ingestion: An automated scheduler (e.g., cron/Airflow) queries public repositories, mines for qualifying code contributions (commits, PRs, or issues), and produces raw candidate instances. Filtering criteria include language, license, activity thresholds, and metadata such as recency or presence of test-editing changes (Zhang et al., 29 May 2025, Badertdinov et al., 26 May 2025, Zeng et al., 24 Jun 2025).
Candidate Construction: Automated filtering removes instances that do not meet task requirements (e.g., no test edits, excessive code churn, lack of test coverage), extracting the relevant context: issue body, PR diff, and metadata. These are queued as candidate tasks for further validation.
Automated Environment Setup: Using agentic environment synthesis modules (often LLM-driven), the pipeline constructs a clean, fully reproducible runtime environment. Specialized agents extract signals from project configuration, select compatible base images, generate or patch Dockerfiles, and iteratively build the container with ReAct tool-use and error recovery. Success is validated by running the test pipeline to completion (Zhang et al., 29 May 2025, Guo et al., 12 Jun 2025, Zhang et al., 31 Jan 2026).
Patch/Test Validation: For each candidate, the “golden” patch is applied to the base environment. Tests are executed before and after the patch (often using exit-code-based grading or standard output parsing) to verify at least one FAIL_TO_PASS transition and zero regressions. Flaky tests are excluded by repeated runs (Zhang et al., 29 May 2025, Guo et al., 12 Jun 2025, Badertdinov et al., 26 May 2025).
Continuous Updating: The pipeline persistently polls for new events (e.g., monthly/daily) and integrates new valid instances into the public dataset, container registry, and dashboard interface. Automated versioning and metadata tracking ensure data lineage and contamination resistance (Zhang et al., 29 May 2025, Badertdinov et al., 26 May 2025, Zeng et al., 24 Jun 2025).

The data flow is strictly unidirectional, and each stage outputs well-typed artifacts (e.g., Docker image, test command, validation logs), facilitating atomicity, reproducibility, and downstream integration.

2. Pipeline Algorithms, Multi-Agent Orchestration, and Environment Synthesis

A distinctive feature of SWE-Factory-style workflows is the heavy reliance on multi-agent, iterative orchestration for environment construction and validation. Common agent roles include:

Repository Explorer: Gathers static and dynamic signals (dependencies, test entrypoints, build scripts) via active repo inspection.
Environment Manager: Synthesizes or updates Dockerfile/environment specs, reusing prior (memory pool) solutions when possible (Guo et al., 12 Jun 2025, Zhang et al., 31 Jan 2026).
Test Manager/Eval Script Agent: Constructs and verifies test execution scripts, ensuring standardized result grading via appended exit code markers.
Test Analyst: Builds images, applies patches, executes tests, and parses errors, providing actionable feedback for repair in the agent loop.

Orchestration is generally framed as a loop with early stopping upon validation success or intervention upon failure signals (e.g., syntax error, install failure, test error). Loop-detection controllers and cross-task memory (for demonstration/solution reuse) have been shown to improve convergence and avoid wasted computational cycles (Zhang et al., 31 Jan 2026).

Pseudocode patterns, such as those for candidate extraction and environment orchestration, exemplify the principle of tight micro-service boundaries and error propagation:

def build_environment(base_commit):
    image = select_base_image(repo)
    container = Docker.run(image)
    container.exec("git checkout " + base_commit)
    steps = 0
    while steps < MAX_STEPS:
        thought = setup_agent.think(container.state())
        action = setup_agent.decide(thought)
        obs = container.exec(action.command)
        setup_agent.observe(obs)
        if setup_agent.believes_tests_pass():
            break
        steps += 1
    if not setup_agent.believes_tests_pass():
        return failure
    test_cmd = verify_agent.discover_test_command(container)
    results = container.exec(test_cmd)
    if verify_agent.passes(results):
        return Docker.commit(container)
    else:
        return failure

(Zhang et al., 29 May 2025)

3. Validation Modalities, Grading, and Contamination Resistance

Pipeline validation leverages deterministic, language-agnostic criteria. By enforcing that every ground-truth patch demonstrates a true FAIL_TO_PASS test outcome (i.e., at least one test fails before, then passes after, patch application), and zero regressions (PASS_TO_PASS stability), the process guarantees high data fidelity. SWE-Factory introduced a universally applicable, exit-code-based grading scheme, appending explicit markers to test scripts to denote pass/fail outcome, thus obviating fragile log parsing and supporting precision=0.92, recall=1.00 versus manual inspection (Guo et al., 12 Jun 2025).

Key formulas for common metrics:

Resolved rate:

$\text{resolved\_rate} = \frac{\#\text{instances\_resolved}}{\#\text{total\_instances}} \times 100\%$

Patch apply rate:

$\text{apply\_rate} = \frac{\#\text{syntactically\_valid\_patches}}{\#\text{total\_attempts}} \times 100\%$

Localization rate:

$\text{loc\_rate} = \frac{\#\text{patches\_touching\_same\_files\_as\_gold}}{\#\text{total\_instances}} \times 100\%$

Contamination resistance is achieved by restricting data collection to issues and PRs that postdate all model releases under evaluation, avoiding overlap with pretraining corpora. Continuous updating, along with careful timestamp queries and release cutoff dates, further enforces this property (Zhang et al., 29 May 2025, Badertdinov et al., 26 May 2025).

4. Cost, Efficiency, and Infrastructure Optimization

Cost and efficiency are critical to practical deployment at scale. SWE-Factory and RepoForge both report per-instance validation costs on the order of \$0.024–\$0.045 with current LLM agent frameworks (e.g., Gemini-2.5-flash and GPT-4.1-mini), with coverage rates of ≈33–40% after validation (Guo et al., 12 Jun 2025). RepoForge reduces per-environment image storage by approximately 14× (1.4 GB→102 MB) using layer de-duplication and minimal dependency inference, supporting thousands of environments over a few hundred base images (Chen et al., 3 Aug 2025).

Distributed execution harnesses using architectures like Ray enable >70% faster evaluation by pooling and reusing already-built containers, scaling up to 64 workers with near-linear speedup (Chen et al., 3 Aug 2025). In production-class settings, versioning is maintained by tagging dataset/output artifacts and registering images in public repositories or object stores, supporting rollback and full lineage tracing (Badertdinov et al., 26 May 2025, Zeng et al., 24 Jun 2025).

5. Generalization and Adaptation to SWE and Safety Domains

SWE-Factory-style pipelines generalize beyond bug-fixing tasks, forming the backbone of benchmarks for performance optimization (SWE-fficiency), safety-case management (Safety Factories), and environment construction (DockSmith). Adaptation involves replacing/augmenting the task formalism and validation layer, e.g.:

Performance benchmarking: Add workload annotation and automated speedup measurement with statistical significance tests. Only instances with >2σ improvement and test-passing patches are admitted (Ma et al., 8 Nov 2025).
Safety engineering (Safety Factories): Extend traditional pipelines with safety artifact modeling—formal, machine-processable safety cases, claim-argument-evidence structures, and consistency checks run alongside unit and integration tests. Artifacts span hazard logs, formal safety goals, and live documentation. Safety builds serve as release gates, requiring all safety criteria to be met before deploying (Cârlan et al., 10 Sep 2025).
Environment construction: Treat as a first-class agentic task, using multi-agent orchestration, acceptance shaping, loop detection, and cross-task success memory to maximize build success, generalizability, and transfer of environment reasoning skills to downstream tasks (Zhang et al., 31 Jan 2026).

By decomposing the workflow into single-responsibility, composable micro-services—task construction, environment setup, validation, continuous integration, and publishing—SWE-Factory-style pipelines support the reproducible, scalable assembly of diverse SWE benchmarks, agent datasets, and real-world evaluation harnesses (Zhang et al., 29 May 2025, Guo et al., 12 Jun 2025, Badertdinov et al., 26 May 2025).

6. Empirical Results and Impact

Adoption of the SWE-Factory paradigm has materially improved both the scale and quality of SWE benchmarks and agent training datasets. For example, SWE-rebench yielded over 21,000 interactive Python-based tasks (Badertdinov et al., 26 May 2025), while Skywork-SWE curated over 10,169 validated problem instances with systematically observed scaling laws (performance grows linearly with log(data size), without saturation below 8,000+ trajectories) (Zeng et al., 24 Jun 2025).

Empirical measurements consistently show that strictly validated, contamination-resistant tasks obtained via a SWE-Factory pipeline are significantly harder for current LLM-based agents compared to legacy, static datasets; performance gaps of 10–20% absolute are typical (Zhang et al., 29 May 2025, Badertdinov et al., 26 May 2025). In post-training pipelines (e.g., SWE-Master), modular factory staging—teacher data synthesis, long-horizon SFT, RL with execution feedback, test-time scaling—enables open-source models to reach 61.4–70.8% resolve rate, ∼10× improvement over base models (Song et al., 3 Feb 2026).

Moreover, the industrial paradigm is now extending to long-horizon, memory-efficient reasoning (as in the context-tool “Cat” pipeline), further raising the bar for agentic autonomy on real-world SWE (Liu et al., 26 Dec 2025).

7. Lessons Learned, Best Practices, and Open Challenges

Key lessons from deployed SWE-Factory-style systems include:

Aggressive environment reuse (memory pools, base-image registries) significantly reduces cost and compute.
Exit-code and container-based validation yields high robustness across languages and ecosystems.
Multi-agent decomposition (context→Dockerfile→eval→analysis) aligns with human divisional labor, enabling modular debugging and extension (Zhang et al., 31 Jan 2026).
Continuous updating and live integration with source control systems enforce contamination resistance, but require rigorous version management and metadata integrity.
Automation of artifact grading, task difficulty, and clarity assessment via LLMs and static analysis tools can introduce misclassification errors, necessitating periodic manual spot-checks.
Limitations persist in handling complex or heavily interdependent codebases (notably large C++/Python monorepos, or cases with insufficient test coverage).
Generalization to new domains requires careful adaptation of validation and scoring mechanisms.

A plausible implication is that future research on agentic code reasoning, performance optimization, and software safety will increasingly rely on large-scale, continuously updated, and deterministically validated pipelines in the SWE-Factory paradigm, with modular architectures supporting rapid adaptation to emerging agent capabilities and evaluation domains (Zhang et al., 29 May 2025, Guo et al., 12 Jun 2025, Badertdinov et al., 26 May 2025, Cârlan et al., 10 Sep 2025, Liu et al., 26 Dec 2025).