Skywork-SWE-32B: 32B Code Repair LLM
- Skywork-SWE-32B is an open-source, 32-billion-parameter large language model engineered for automated Python code repair, validated on the SWE-bench Verified benchmark.
- It leverages a fully automated multi-stage data curation pipeline to generate diverse, execution-validated training trajectories from real-world GitHub Python repositories.
- Empirical evaluations demonstrate log-linear improvements in pass@1 performance, reaching up to 47.0% with test-time scaling techniques such as best-of-8 sampling.
Skywork-SWE-32B is an open-source, 32-billion-parameter LLM agent for automated software engineering, particularly focused on code repair in real-world Python repositories. Fine-tuned from Qwen-2.5-Coder-32B-Instruct using multi-turn, execution-grounded trajectories, Skywork-SWE-32B establishes new empirical records on the SWE-bench Verified benchmark for its parameter class. This outcome is made possible by a data-centric approach: the model leverages a highly automated pipeline for dataset construction, amassing breadth and diversity significantly beyond prior works and providing sustained performance gains as the dataset size increases. Its release, along with containerized runtime environments and evaluation tools, is intended to facilitate reproducible research and further advances in LLM-based software engineering agents (Yang et al., 30 Apr 2025, Zeng et al., 24 Jun 2025).
1. Model Architecture and Parameterization
Skywork-SWE-32B employs the Qwen-2.5-Coder-32B-Instruct as its base, a transformer decoder specialized for code and natural text. The key architectural traits are as follows:
- Transformer stack: 32 layers of self-attention and position-wise feed-forward networks.
- Model (hidden) dimension: 12,288.
- Attention mechanism: 32 heads (head dimension 384) with rotary positional embeddings.
- Feed-forward inner dimension: 49,152.
- Parameter count: approximately 32 billion.
- All parameters—including attention and feed-forward modules—are fine-tuned under mixed-precision (bfloat16).
Inference is performed in the OpenHands agent framework (v0.32.0), with support for multi-turn task decomposition over long contexts (up to 32k tokens) (Zeng et al., 24 Jun 2025).
2. Automated Data Curation Pipeline and Dataset Characteristics
The model’s data regime is built on a three-stage, fully automated pipeline:
- Stage A: Data Collection and Pre-filtering
- GitHub API mining yields Python repositories, filtered for minimum star count and not overlapping with SWE-bench Verified.
- PRs are selected only if merged, reference closed/fixed issues, and touch files with 'test' in their path.
- Repositories are reverted to base commits and installed/unified via script; only successful installs are retained.
- Stage B: Execution-based Validation and Runtime Environments
- Each candidate instance is encapsulated as a Docker image with an OS, Conda Python 3.9 environment, repositories, dependencies, and install/test scripts.
- Test validation: task is kept only if at least one test transitions from failing (empty state) to passing (after the gold patch).
- Final dataset: 10,169 verified Python task instances from 2,531 repositories, each with natural language task specification and an isolated runtime environment.
- Stage C: Agent Trajectory Generation
- Up to 100-turn OpenHands rollouts are executed using proprietary LLMs for patch proposal and refinement.
- Only successful agent trajectories whose final predictions pass all tests are retained.
- 8,209 consistent trajectories are used for supervised fine-tuning (Zeng et al., 24 Jun 2025).
This dataset is notable for its repository and temporal diversity (tasks span 2013–2025; long-tailed repo instance counts), its granularity (task-level Docker containers), and fine validation (unit-test transitions). Each data point encodes issue text, patch metadata, FAIL_TO_PASS and PASS_TO_PASS test labels, and other provenance details.
3. Dataset Scaling Laws and Empirical Performance
A key finding is the sustained performance scaling as a function of dataset size. On SWE-bench Verified, the pass@1 metric (the fraction of tasks solved on the first attempt) increases logarithmically with the number of training trajectories , exhibiting no saturation at : A power-law fit also closely tracks the data: Empirically, Skywork-SWE-32B achieves 38.0% pass@1 (single rollout, no verifier, 100-turn interaction budget) and, using test-time scaling via best-of-8 sampling and critic scoring, achieves 47.0% pass@1; this surpasses previous open-source 32B code agents such as Qwen-2.5-Coder-32B (6.4%), SWE-Gym-32B (20.6%), SWE-Dev-32B (36.6%), and SWE-smith-LM-32B (40.2%) in the same evaluation setting (Zeng et al., 24 Jun 2025, Yang et al., 30 Apr 2025).
4. Training Regimen and Test-Time Augmentation
Supervised fine-tuning is performed as follows:
- Framework: TorchTune (PyTorch).
- Hardware: 8× NVIDIA H800 GPUs.
- Precision: mixed (bfloat16).
- Optimizer: AdamW (weight decay 0.01), cosine learning rate schedule (peak 5 × 10⁻⁵).
- Batch: one trajectory per GPU (gradient accumulation for effective batch 32).
- Epochs: three over the full trajectory set.
- No use of curriculum learning or LoRA/adapters; all weights are updated.
Test-time scaling comprises:
- Best-of-N Sampling: 8 independent rollouts per instance, temperature 0.8, with OpenHands critic selecting the highest-quality patch.
- Iteration Scaling: allowing up to 100 turns per agent rollout, with pass@1 increasing with higher turn limits (notably, 28.2% at 10 turns, 38.0% at 100 turns) (Zeng et al., 24 Jun 2025).
5. Evaluation Methodology and Comparative Analysis
Evaluation is performed on SWE-bench Verified, a benchmark of repository-level bug-fixing tasks. The pass@k metric is used, where (for benchmark instances and attempts per instance):
Skywork-SWE-32B’s 47.0% pass@1 (with TTS) positions it above prior open and many proprietary systems in the same setting. In the single-rollout regime, its 38.0% pass@1 is competitive with GPT-4o (33.2%) and Gemini-2.0-Flash (20.0%). Analysis shows that even powerful LLMs obtain only 15–20% “zero-shot” rollout success on newly curated Skywork-SWE tasks, illustrating the benchmark’s rigor and the importance of high-quality, execution-validated trajectories (Zeng et al., 24 Jun 2025).
6. Model Release, Dependencies, and Reproducibility
The Skywork-SWE-32B model checkpoint is released under Apache 2.0 (https://huggingface.co/skywork-ai/skywork-swe-32b), accompanied by all code, dataset builds, Docker recipes for test environments, and evaluation commands. Key dependencies include Python 3.9, Docker, Conda, pytest, hypothesis, mock, setuptools, and OpenHands (v0.32.0).
A standard procedure involves:
1 2 3 4 5 6 7 8 9 10 11 12 |
git clone https://github.com/skywork-ai/skywork-swe-32b cd skywork-swe-32b docker build -t skywork-swe:myproj --build-arg REPO=https://github.com/user/myproj \ --build-arg COMMIT=<base_commit> -f Dockerfile.instance . openhands eval \ --model skywork-swe-32b \ --framework openhands \ --tasks swe-bench-verified \ --max-rollouts 8 \ --max-turns 100 cd runtime-images/myproj pytest --maxfail=1 --disable-warnings -q |
7. Significance and Implications
The Skywork-SWE-32B project demonstrates that systematic scaling of dataset breadth and diversity, coupled with execution-grounded validation and multi-turn agent fine-tuning, yields consistent, log-linear improvements in LLM-based code repair performance, without observable saturation. The methodology provides a scalable template for future data-driven work in LLM software engineering agents. A plausible implication is that further increases in instance count and trajectory quality are likely to yield continuing improvements, pending architectural or algorithmic advances. The comprehensive release of code, Docker environments, and model weights lowers the barrier of entry for academic and industrial experimentation in automated software engineering (Yang et al., 30 Apr 2025, Zeng et al., 24 Jun 2025).