SWE-Lego: Hybrid Dataset for Bug Resolution

Updated 11 January 2026

SWE-Lego is a large-scale, hybrid software engineering dataset that integrates real-world and synthetic Python bug scenarios with expert-generated trajectories.
The dataset employs a standardized JSON schema and rigorous Docker-based validation to ensure reproducible and high-quality task resolution data.
SWE-Lego enables curriculum learning through empirical difficulty stratification and drives state-of-the-art performance in automated software issue resolving benchmarks.

SWE-Lego is a large-scale, hybrid software engineering dataset engineered to advance supervised fine-tuning (SFT) methodologies for automated software issue resolving. It stands out for its integration of executable real-world and synthetic tasks, rigorously validated expert trajectories, and detailed annotation schemas designed for model-driven program repair and long-horizon reasoning tasks in Python-centric software repositories. Developed within the framework established by "SWE-Lego: Pushing the Limits of Supervised Fine-tuning for Software Issue Resolving," it forms the foundation for state-of-the-art open-source SWE models evaluated on demanding benchmarks such as SWE-bench Verified (Tao et al., 4 Jan 2026).

1. Dataset Composition and Hybrid Construction

SWE-Lego comprises a total of 32,119 executable task instances, partitioned as 18,409 real-world issues and 13,710 synthetic tasks. Each instance corresponds to a discrete Python software bug resolution scenario. Supervision is provided via teacher-generated action trajectories, with Qwen3-Coder-480B-A35B-Instruct used to generate up to 100 interaction steps per task. Post-hoc execution-based validation yields 14,110 fully-resolved (passing all tests) and an additional 4,000 semi-resolved trajectories (perfect file localization, not all tests pass), resulting in 18,110 validated expert trajectories, denoted as “18k” in summary metrics. Validation is performed by replaying trajectories in Docker sandbox environments with stringent audit against test manipulation (“Git hacking”).

Task Subset	Instance Count	Trajectories (Validated)
Real-world	18,409
Synthetic	13,710
Total Tasks	32,119
Resolved		14,110
Semi-Resolved		4,000
Total Trajectories		18,110

The real-world segment is derived from an automated fork of SWE-rebench, filtered for license compatibility, successful Docker builds, and quality-controlled pull requests. Synthetic tasks are generated in situ via the SWE-smith methodology, leveraging (a) LLM-Rewrite to induce realistic logic perturbations, and (b) AST-Reformulation for randomized lower-level code mutations. This combination yields both breadth and depth in scenario variability and bug typology.

2. Data Schema, Annotation Procedure, and Task Representation

Each SWE-Lego task adheres to a standardized JSON schema, encompassing the following elements:

problem_statement: Natural-language bug description (GitHub issue for real data; LLM-generated for synthetic).
FAIL_TO_PASS / PASS_TO_PASS: Explicit listing of test files/IDs reflecting behavioral change or invariance due to patch application.
golden_patch: Unified diff, representing the ground-truth bug fix (human PR for real, diff-inversion for synthetic).
image_name: Docker image tag guaranteeing sandbox reproducibility (unique per real instance; shared for synthetics per base commit).
language: Python by construction.

Expert trajectories are sequences of tool-oriented actions (execute_bash, str_replace_editor, think, finish), interleaved with environmental observations. Post-rollout, each trajectory is tagged as resolved (all tests pass, no tests modified), semi-resolved (100% buggy file recall, incomplete test pass), or unresolved, with clear recycling criteria for semi-resolved traces.

3. Difficulty Annotation, Empirical Difficulty Stratification, and Curriculum Integration

Task difficulty is operationalized via empirical trajectory length (number of agent-environment turns). The observed correlation (Pearson $r = -0.95$ ) between trajectory length and instance resolution rate underpins a stratification into three tiers:

Easy: 0–50 turns
Medium: 50–70 turns
Hard: 70–100 turns

A three-stage curriculum learning process is adopted: Stage 1 (Easy), Stage 2 (Easy+Medium), and Stage 3 (all tiers), with retention of earlier-stage data to mitigate catastrophic forgetting. This structure aligns error typology progression with model adaptation, informed by analyses showing sequential failure modes (from “Failed to Reproduce” to logic/localization errors).

4. Quality Control, Preprocessing, and Utilization in SFT

Dataset integrity is maintained through rigorous controls:

Repository gatekeeping: Only codebases building in Docker and passing self-tests are considered.
Git history sanitization: Post-issue commits (real) and all history (synthetic) are pruned to prevent information leakage.
Tool error correction: Automatic parsing and clipping for malformed editor calls.
Ablation-pruning: Removal of task_tracker tool after observed error amplification.
Trajectory filtering: Exclusion of test-modifying resolutions, evidenced to improve SFT resolve rate from 40.4% to 41.0%. Semi-resolved trajectory recycling yields an additional +1.2% downstream performance increase.

No fixed data splits are specified within the dataset; the SFT pipeline utilizes all validated trajectories, with downstream assessment on the SWE-bench Verified set. Each trajectory is serialized as a prompt–response JSONL file with step-level error masking, then ingested into a LLaMA-Factory-based SFT regime (up to 128k context tokens via RoPE scaling, e.g., YaRN), with 4 epochs of AdamW, cosine schedule, batch size 64, and tailored learning rates.

5. Bug and Issue-Type Diversity Statistics

SWE-Lego exhibits broad bug-type coverage over ten principal categories (Table A.1, Appendix), with distinct real-synthetic emphases:

Bug Category	Real (%)	Synthetic (%)
API/signature mismatch	26.6	12.8
Logic/conditional bug	29.9	49.4
Input validation/boundary error	16.1	5.9
Constructor/inheritance contract break	2.5	7.5
Missing import/symbol error	8.3	19.3

Real-world data are enriched for interface and boundary-related bugs, while synthetic instances disproportionately target internal logic and structure. This hybrid composition is intended to ensure robustness of learned models to the diverse error modalities encountered in practical bug resolution scenarios.

6. Significance, Benchmarking, and Downstream Applications

SWE-Lego’s scale (32k tasks, 18k validated expert trajectories), executable sandbox fidelity (3k+ repositories), and curation practices (code provenance control, tool-action normalizations, stratified curriculum) collectively establish it as a rigorous resource for advancing open-source SFT in program repair. Models fine-tuned solely on SWE-Lego achieve state-of-the-art resolve rates (SWE-Lego-Qwen3-8B: 42.2%; SWE-Lego-Qwen3-32B: 52.6%), with further improvements observed under test-time scaling and auto-verification (e.g., TTS@16: 8B from 42.2% to 49.6%, 32B from 52.6% to 58.8% on SWE-bench Verified). These results demonstrate that the dataset, in conjunction with careful curation and structured SFT, can drive robust, scalable, and reproducible advances in automated software issue resolution workflows (Tao et al., 4 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

SWE-Lego: Pushing the Limits of Supervised Fine-tuning for Software Issue Resolving (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SWE-Lego Dataset.