SWE-Lego: Modular Software Engineering
- SWE-Lego is a modular framework that leverages supervised fine-tuning, reinforcement learning, and Lego metaphors for effective software and agent reasoning.
- It integrates real-world datasets from pull requests with synthetic bug injections to achieve state-of-the-art performance in automated software issue resolution.
- Through curriculum-based training, test-time scaling, and verifier-guided rollouts, SWE-Lego improves accuracy and efficiency in complex, compositional tasks.
SWE-Lego refers to a suite of approaches, datasets, and frameworks leveraging supervised learning, reinforcement learning, interactive simulation, and/or physical metaphor to address complex challenges in software engineering and reasoning—often using “Lego” as a conceptual or practical scaffold. Contemporary research spans automated software issue resolving via LLMs (SWE-Lego, (Tao et al., 4 Jan 2026), role- and symmetry-aware multi-agent control (LEGO GNNs, (Wang et al., 17 Sep 2025), interactive visual reasoning with Lego analogues (LTRON, (Walsman et al., 2022), and pedagogical uses of Lego metaphors for teaching software quality assurance (Morales-Trujillo, 2021). The following sections synthesize these lines under the unifying theme of modular, compositional, and curriculum-driven approaches for software and agent reasoning.
1. Supervised Fine-Tuning for Software Issue Resolving: SWE-Lego Framework
SWE-Lego (Tao et al., 4 Jan 2026) introduces a supervised fine-tuning (SFT) methodology designed to achieve state-of-the-art results on the SWE-bench Verified benchmark for software engineering issue resolution, using solely SFT without reliance on complex pipelines involving intermediate pretraining or reinforcement learning. The framework is constructed on three pillars:
a) SWE-Lego Dataset Construction
The dataset comprises 32,100 high-quality task instances and 18,100 validated agent trajectories, integrating both real and synthetic examples:
- Real-world tasks: 18,400 instances derived from merged GitHub pull requests, including issue descriptions, full Dockerized pre-merge code snapshots, PR diffs as golden patches, and partitioned test sets (FAIL-TO-PASS and PASS-TO-PASS). Each task affects on average 3.7 files, 9.5 hunks, and 138 lines.
- Synthetic tasks: 13,700 instances via SWE-smith bug injection (LLM re-writes based on docstrings/signatures; AST-level random transformations). All synthetic bugs per repository share a single sandbox, touch on average 1 file, 1.3 hunks, and 18.8 lines.
- Trajectory generation: Teacher agent is Qwen3-Coder-480B with OpenHands, max 100 interaction turns. Three "hygiene" interventions ensure trajectory quality: sanitizing commit history, correcting malformed tool calls, and restricting the set of allowed agent tools.
- Validation criteria: Trajectories are labeled fully resolved (all tests pass, no regressions) or semi-resolved (perfect buggy file localization, not all tests pass). The released set contains 14,100 fully resolved and 4,000 semi-resolved validated trajectories.
b) Refined SFT: Step-level Error Masking and Curriculum
SWE-Lego introduces two critical refinements to the standard SFT protocol:
- Step-level error masking: During SFT, token-level loss gradients for agent actions taken in any turn yielding an environment error are masked out. This targets the problem of erroneous action reinforcement in conventional SFT.
- Difficulty-based curriculum: Tasks are bucketed by agent trajectory length (turns): Easy (0–50), Medium (51–70), Hard (71–100). SFT proceeds in three curriculum stages, sequentially introducing higher-difficulty tasks. Empirical analysis shows a strong negative correlation (r = –0.95) between trajectory length and resolve rate.
c) Test-Time Scaling (TTS) via Verifier-Guided Rollouts
SWE-Lego further augments SFT performance at inference through parallel and sequential scaling:
- Sequential scaling: Increased maximum agent turns up to 100–140 yields most gains; beyond that, further increases plateau due to truncation.
- Parallel scaling: Generate K rollouts per test case, then select the top via a trained verifier. The generative verifier operates by prompting the model to judge resolution success ("yes"/"no"), using the output probability ratio as a confidence score.
- Empirical boost: For the 8B model, resolve rate improves from 42.2% (vanilla SFT) to 49.6% (TTS@16); for 32B, from 52.6% to 58.8%.
| Model | SFT Only | TTS@16 (Verifier) |
|---|---|---|
| SWE-Lego-Qwen3-8B | 42.2% | 49.6% |
| SWE-Lego-Qwen3-32B | 52.6% | 58.8% |
Significant ablation findings include that scaling the hybrid dataset (real + synthetic) yields a +25.6% boost (23.2% to 48.8% in Qwen3-32B), while error masking and curriculum refinements contribute +3.8%, with TTS + generative verifiers adding +6.2% absolute gain (Tao et al., 4 Jan 2026).
2. Modular Datasets and Task Synthesis
SWE-Lego's dataset design reflects the modular, compositional ethos underlying the Lego metaphor:
- Base plate curation: 3,251 Python OSS repositories (via SWE-rebench) are Dockerized and filtered for build- and test-viability.
- Hybrid augmentation: Synthetic data augments the sparse but deep real PR-based tasks, scaling validated trajectories per repo from ~5K to ~14K when adding five synthetic instances per repository, and monotonically improving model resolve rates as synthetic coverage increases.
- Tooling constraints: Strict sandboxing, tool whitelisting ({execute_bash, str_replace_editor, think, finish}), and post-hoc patch validation enforce the integrity and reproducibility of the dataset.
This setup enables SFT-only methods to reach or exceed the performance of hybrid RL+SFT models at comparable model sizes.
3. Test-Time Scaling and Verifier Paradigms
Test-Time Scaling (TTS) elevates SWE-Lego beyond standard SFT limits by leveraging agent diversity and confidence-driven selection:
- Verifier architectures:
- Regressive: Binary classifier head fine-tuned with BCE loss.
- Generative: Next-token prediction over "yes"/"no" verdicts, with generative scores directly aligned to pretraining.
- Training regimes: 18K verifier trajectories (5K resolved, 13K unresolved) on Qwen3-8B or Qwen3-Coder-30B-A3B backbones.
- Verifier performance: Generative verifiers consistently outperform both regressive versions and comparator baselines (e.g., OpenHands-Critic-32B) for K>1, yielding superior scaling properties for parallel rollouts.
This staged inference protocol is directly responsible for observable gains under fixed latency or resource budgets.
4. Modular Reasoning and Equivariant Architectures in Agent Control
The principles motivating SWE-Lego appear in other research utilizing "LEGO" as a modular, compositional framework for agent reasoning:
- LEGO GNNs for MARL (Wang et al., 17 Sep 2025): The Local-Canonicalization Equivariant Graph Neural Network framework introduces modularity via role-based GNNs that encode both permutation symmetry and Euclidean transformations, supporting generalization to variable agent populations and geometric arenas. The canonicalization operator strips global pose, while the GNN architecture pools over per-role agent graphs. This design principle achieves marked gains in sample efficiency, zero-shot team scaling, and robustness in both simulated and real-world drone experiments.
- LTRON interactive assembly (Walsman et al., 2022): Modular sequence-to-sequence architectures ("StudNet-A/B") process visual observations of decomposed and reassembled Lego structures, with performance bottlenecked not by single-frame perception, but by long-horizon, compositional planning and memory—an outcome directly mirroring the chunked, composable structure of both real and synthetic Lego scenes.
A plausible implication is that modular, curriculum-based training and equivariant architectures are advantageous for complex, structured software- and agent-reasoning tasks characterized by compositionality and analogical transfer.
5. Pedagogical Applications and Quality Assurance via Lego Metaphors
The Lego metaphor extends into software engineering education:
- KUALI-Brick Activity (Morales-Trujillo, 2021):
- Assigning teams to design, build, assess, and improve artifacts under explicit quality attribute and process constraints.
- Role rotation (builder, assessor), peer review analogues, and the application of ISO/IEC 25010-aligned metrics—defect density, efficiency, build throughput—directly in the Lego domain.
- Iterative reflection, Plan–Do–Check–Act cycles, and process tailors (e.g., task-oriented, role-oriented, self-organized approaches).
- Outcomes: Participants report strong engagement (mean 4.96/5 fun), reinforced learning (4.28/5), and an enhanced grasp of abstract quality principles.
6. Broader Implications and Future Research Directions
The conceptual and methodological innovations associated with SWE-Lego highlight the following themes:
- Synthetic-real data complementarity: Hybridization is preferable to reliance on either deep, scarce real tasks or shallow, abundant synthetic alone.
- Curriculum learning and difficulty scheduling: Proper pacing across difficulty buckets measurably boosts generalization and convergence rate for SFT-based systems.
- Verifier-guided inference: Generative verifier-based selection under parallel rollouts is a simple yet powerful mechanism to exploit model stochasticity at inference, extensible to other structured domains beyond software issue resolution.
- Modular, compositional action spaces: Both in agent-based and visual reasoning contexts, modular representations—mirrored metaphorically by Lego's physical blocks and implemented concretely in model design—support robust, transferable reasoning.
A plausible implication is that future advances in automated software reasoning, multi-agent coordination, and interactive learning will further exploit the modular, compositional, and curriculum-driven insights crystallized by SWE-Lego and related frameworks (Tao et al., 4 Jan 2026, Wang et al., 17 Sep 2025, Walsman et al., 2022, Morales-Trujillo, 2021).