SWE-Lego: Hybrid Dataset for Bug Resolution
- SWE-Lego is a large-scale, hybrid software engineering dataset that integrates real-world and synthetic Python bug scenarios with expert-generated trajectories.
- The dataset employs a standardized JSON schema and rigorous Docker-based validation to ensure reproducible and high-quality task resolution data.
- SWE-Lego enables curriculum learning through empirical difficulty stratification and drives state-of-the-art performance in automated software issue resolving benchmarks.
SWE-Lego is a large-scale, hybrid software engineering dataset engineered to advance supervised fine-tuning (SFT) methodologies for automated software issue resolving. It stands out for its integration of executable real-world and synthetic tasks, rigorously validated expert trajectories, and detailed annotation schemas designed for model-driven program repair and long-horizon reasoning tasks in Python-centric software repositories. Developed within the framework established by "SWE-Lego: Pushing the Limits of Supervised Fine-tuning for Software Issue Resolving," it forms the foundation for state-of-the-art open-source SWE models evaluated on demanding benchmarks such as SWE-bench Verified (Tao et al., 4 Jan 2026).
1. Dataset Composition and Hybrid Construction
SWE-Lego comprises a total of 32,119 executable task instances, partitioned as 18,409 real-world issues and 13,710 synthetic tasks. Each instance corresponds to a discrete Python software bug resolution scenario. Supervision is provided via teacher-generated action trajectories, with Qwen3-Coder-480B-A35B-Instruct used to generate up to 100 interaction steps per task. Post-hoc execution-based validation yields 14,110 fully-resolved (passing all tests) and an additional 4,000 semi-resolved trajectories (perfect file localization, not all tests pass), resulting in 18,110 validated expert trajectories, denoted as “18k” in summary metrics. Validation is performed by replaying trajectories in Docker sandbox environments with stringent audit against test manipulation (“Git hacking”).
| Task Subset | Instance Count | Trajectories (Validated) |
|---|---|---|
| Real-world | 18,409 | |
| Synthetic | 13,710 | |
| Total Tasks | 32,119 | |
| Resolved | 14,110 | |
| Semi-Resolved | 4,000 | |
| Total Trajectories | 18,110 |
The real-world segment is derived from an automated fork of SWE-rebench, filtered for license compatibility, successful Docker builds, and quality-controlled pull requests. Synthetic tasks are generated in situ via the SWE-smith methodology, leveraging (a) LLM-Rewrite to induce realistic logic perturbations, and (b) AST-Reformulation for randomized lower-level code mutations. This combination yields both breadth and depth in scenario variability and bug typology.
2. Data Schema, Annotation Procedure, and Task Representation
Each SWE-Lego task adheres to a standardized JSON schema, encompassing the following elements:
- problem_statement: Natural-language bug description (GitHub issue for real data; LLM-generated for synthetic).
- FAIL_TO_PASS / PASS_TO_PASS: Explicit listing of test files/IDs reflecting behavioral change or invariance due to patch application.
- golden_patch: Unified diff, representing the ground-truth bug fix (human PR for real, diff-inversion for synthetic).
- image_name: Docker image tag guaranteeing sandbox reproducibility (unique per real instance; shared for synthetics per base commit).
- language: Python by construction.
Expert trajectories are sequences of tool-oriented actions (execute_bash, str_replace_editor, think, finish), interleaved with environmental observations. Post-rollout, each trajectory is tagged as resolved (all tests pass, no tests modified), semi-resolved (100% buggy file recall, incomplete test pass), or unresolved, with clear recycling criteria for semi-resolved traces.
3. Difficulty Annotation, Empirical Difficulty Stratification, and Curriculum Integration
Task difficulty is operationalized via empirical trajectory length (number of agent-environment turns). The observed correlation (Pearson ) between trajectory length and instance resolution rate underpins a stratification into three tiers:
- Easy: 0–50 turns
- Medium: 50–70 turns
- Hard: 70–100 turns
A three-stage curriculum learning process is adopted: Stage 1 (Easy), Stage 2 (Easy+Medium), and Stage 3 (all tiers), with retention of earlier-stage data to mitigate catastrophic forgetting. This structure aligns error typology progression with model adaptation, informed by analyses showing sequential failure modes (from “Failed to Reproduce” to logic/localization errors).
4. Quality Control, Preprocessing, and Utilization in SFT
Dataset integrity is maintained through rigorous controls:
- Repository gatekeeping: Only codebases building in Docker and passing self-tests are considered.
- Git history sanitization: Post-issue commits (real) and all history (synthetic) are pruned to prevent information leakage.
- Tool error correction: Automatic parsing and clipping for malformed editor calls.
- Ablation-pruning: Removal of task_tracker tool after observed error amplification.
- Trajectory filtering: Exclusion of test-modifying resolutions, evidenced to improve SFT resolve rate from 40.4% to 41.0%. Semi-resolved trajectory recycling yields an additional +1.2% downstream performance increase.
No fixed data splits are specified within the dataset; the SFT pipeline utilizes all validated trajectories, with downstream assessment on the SWE-bench Verified set. Each trajectory is serialized as a prompt–response JSONL file with step-level error masking, then ingested into a LLaMA-Factory-based SFT regime (up to 128k context tokens via RoPE scaling, e.g., YaRN), with 4 epochs of AdamW, cosine schedule, batch size 64, and tailored learning rates.
5. Bug and Issue-Type Diversity Statistics
SWE-Lego exhibits broad bug-type coverage over ten principal categories (Table A.1, Appendix), with distinct real-synthetic emphases:
| Bug Category | Real (%) | Synthetic (%) |
|---|---|---|
| API/signature mismatch | 26.6 | 12.8 |
| Logic/conditional bug | 29.9 | 49.4 |
| Input validation/boundary error | 16.1 | 5.9 |
| Constructor/inheritance contract break | 2.5 | 7.5 |
| Missing import/symbol error | 8.3 | 19.3 |
Real-world data are enriched for interface and boundary-related bugs, while synthetic instances disproportionately target internal logic and structure. This hybrid composition is intended to ensure robustness of learned models to the diverse error modalities encountered in practical bug resolution scenarios.
6. Significance, Benchmarking, and Downstream Applications
SWE-Lego’s scale (32k tasks, 18k validated expert trajectories), executable sandbox fidelity (3k+ repositories), and curation practices (code provenance control, tool-action normalizations, stratified curriculum) collectively establish it as a rigorous resource for advancing open-source SFT in program repair. Models fine-tuned solely on SWE-Lego achieve state-of-the-art resolve rates (SWE-Lego-Qwen3-8B: 42.2%; SWE-Lego-Qwen3-32B: 52.6%), with further improvements observed under test-time scaling and auto-verification (e.g., TTS@16: 8B from 42.2% to 49.6%, 32B from 52.6% to 58.8% on SWE-bench Verified). These results demonstrate that the dataset, in conjunction with careful curation and structured SFT, can drive robust, scalable, and reproducible advances in automated software issue resolution workflows (Tao et al., 4 Jan 2026).