Skywork-SWE: LLM Software Engineering Dataset
- Skywork-SWE is a large-scale dataset, model, and evaluation framework that automates the curation of over 10,000 real-world Python task instances in software engineering.
- It employs a three-stage automated pipeline—from repository filtering to Docker-based test validation—to generate high-fidelity, reproducible LLM agent trajectories.
- Quantitative analysis reveals that doubling data leads to a 3–4% pass@1 improvement, establishing state-of-the-art benchmarks for LLM-driven software engineering.
Skywork-SWE is a large-scale dataset, model, and evaluation framework designed for advancing LLM agents in software engineering (SWE). It establishes a new paradigm by providing an automated, scalable data-curation pipeline, resulting in over 10,000 real-world, runtime-validated Python task instances, and demonstrates quantitative data scaling laws for LLM SWE capabilities. Skywork-SWE sets state-of-the-art (SOTA) performance for Qwen2.5-Coder-32B-based models on the SWE-bench Verified benchmark while enabling reproducibility and transparent benchmarking through public release of checkpoints and environments (Zeng et al., 24 Jun 2025).
1. Automated Pipeline for SWE Data Curation
Skywork-SWE introduces an incremental, automated, three-stage pipeline optimized for both volume and quality of task instances:
- Stage A: Code File Filtering & Repository Selection. From approximately 151,000 GitHub repositories (with stars > 0 and excluding SWE-bench Verified repositories), the pipeline retains 8,472 repositories with complete metadata. It collects 146,568 merged pull requests (PRs) that close a linked issue and touch files whose path or name contains “test” or “testing.”
- Stage B: Environment Image Creation & Automated Unit-Test Validation. Each PR is checked out at its base commit, and a unified install command (
python=[3](https://www.emergentmind.com/topics/se-3-equivariant-tensor-field-network).9; apt-get install -y …; pip install -r requirements.txt …; pytest …) is used to build three-layered Docker images (base, environment, instance) for environment caching. Two rounds of tests are run: (1) before (“empty”) and (2) after (“gold”) the bug fix is applied. PRs are retained if there is at least one FAIL→PASS transition. - Stage C: Agent-Trajectory Generation & Validation. Using proprietary LLMs (e.g., Gemini-2.5-Pro, GPT-4.1) within the OpenHands v0.32.0 framework and up to 100 agent interaction turns, candidate patches are generated and validated by automated testing within Docker environments. Only trajectories resulting in all tests passing are retained.
Skywork-SWE Pipeline Statistics
| Stage | #Repos | #PRs/Instances |
|---|---|---|
| Raw GitHub metadata | 151,472 | — |
| After metadata filtering | 8,472 | — |
| After PR attribute filtering | — | 146,568 |
| After install-check (Stage A3) | — | 23,389 |
| After execution-based validation (Stage B) | — | 10,169 |
| Validated agent trajectories (Stage C) | — | 8,209 |
This pipeline is designed to minimize manual annotation, automate environment setup, and ensure that each included instance can be reproducibly tested at scale (Zeng et al., 24 Jun 2025).
2. Dataset Structure and Task Representation
The final Skywork-SWE dataset contains 10,169 Python SWE task instances extracted from 2,531 GitHub repositories. Each instance comprises:
- A natural-language problem description (from the original issue text)
- Zero or more “hints” from associated PR discussions
- A multi-file, multi-hunk golden patch
- A dedicated, reproducible Docker image for the complete runtime environment
- An associated unit-test harness (standardized with pytest)
Task instances are formatted in prompts for LLM-based agents as follows:
1 2 3 4 5 6 |
System: You are an SWE agent.
User: • Issue description …
• Files under <repo> at commit <base_commit>.
Agent: <series of edit/diff actions>
…
Agent (final): propose patch <diff> |
A validator applies the patch and runs pytest to verify outcomes. Only instances with validated transitions—typically FAIL→PASS for test cases—are included. This design ensures high-fidelity, iterative tasks suitable for evaluating agent-driven, multi-turn repair and synthesis (Zeng et al., 24 Jun 2025).
3. Data Scaling Laws in LLM-Driven SWE
Empirical analysis of Skywork-SWE-32B’s performance (pass@1 on SWE-bench Verified) as a function of training set size reveals a log-linear trend that does not plateau at the current dataset scale (N=8,209). Two fits are presented:
- Logarithmic: ,
- Power law:
Data points illustrate the gain:
- , pass@1: 31.8%
- , pass@1: 36.1%
- , pass@1: 38.0%
Qualitatively, every doubling of data yields a 3–4 percentage-point improvement in pass@1. This scaling pattern closely mirrors classical language-model scaling laws ( or ). There is no evidence of saturation at the current corpus size, indicating further efficiency gains with increased high-quality data (Zeng et al., 24 Jun 2025).
4. Model Architecture and Training Details
- Base Model: Qwen2.5-Coder-32B-Instruct (32 billion parameters)
- Fine-tuning Dataset: 8,209 validated, multi-turn agent trajectories from Skywork-SWE
- Training Configuration:
- Framework: PyTorch + TorchTune
- Hardware: 8 × NVIDIA H800 GPUs
- Optimizer: AdamW, weight_decay = 0.01
- Learning rate: cosine schedule, peak at
- Batch size: 32 per GPU
- Epochs: 3 (≈12 hours total)
- Max context length: 32,768 tokens
Training is fully supervised (next-action prediction), not reinforcement learning. Long-context and iterative-capability demands are met through the agentic prompt structure and supervised demonstration (Zeng et al., 24 Jun 2025).
5. Benchmarking Results and Test-Time Scaling
Skywork-SWE-32B is evaluated on the 500-instance SWE-bench Verified benchmark, with resolve rate defined as pass@1 using a single rollout (). Baseline and comparative model results are as follows:
| Model (OpenHands v0.32.0) | Params | resolve@1 | +TTS@8 |
|---|---|---|---|
| Qwen-2.5-Coder-32B-Instruct (baseline) | 32B | 6.4% | 20.6% |
| SWE-Dev-32B | 32B | 36.6% | 42.9%¹ |
| SWE-Smith-LM-32B | 32B | 40.2% | 45.1%¹ |
| Devstral (Mistral-24B) | 24B | 46.8% | 50.2%¹ |
| Skywork-SWE-32B | 32B | 38.0% | 47.0% |
¹ Approximate TTS numbers from reported Best-of-8 trends.
Key findings:
- Skywork-SWE-32B achieves 38.0% pass@1, raising baseline Qwen-2.5-Coder-32B performance from 6.4% without verifier or ensembling.
- With Test-Time Scaling (TTS, Best-of-8 with critic selection), performance increases to 47.0%, +9 points over vanilla inference.
- The impact of TTS is consistent across proprietary SOTA models (e.g., Claude v3.7, 56%), indicating inference-time ensembling generalizes across agents and architectures (Zeng et al., 24 Jun 2025).
6. Insights and Prospective Directions
The automated pipeline established by Skywork-SWE demonstrates that curated data volume remains the principal limiting factor in LLM-driven SWE. Key observations:
- The scaling trend ( or ) underscores that investing in larger, rigorously validated SWE corpora yields continued improvements.
- Agentic framework choice (OpenHands) and high-fidelity test validation exert more influence than model parameter count alone.
- Public release of the Skywork-SWE-32B checkpoint and all task-specific Docker images permits reproducible, transparent benchmarking in open research.
Future trajectories include:
- Multi-language extension: Application to languages beyond Python (e.g., Java, JavaScript) as noted with the proposed Multi-SWE-Bench.
- Online reinforcement learning: Utilizing validated test outputs as reward feedback to optimize long-horizon policies, referencing potential directions such as SkyRL.
- Automated dependency handling: Developing LLM agents to autonomously infer and install repository-specific dependencies, thereby augmenting yield over environments filtered out by unified-install constraints.
By systematizing data curation and validation while empirically characterizing data scaling effects, Skywork-SWE provides a foundation for more robust, scalable, and open LLM-based agents in real-world software engineering contexts (Zeng et al., 24 Jun 2025).