SWE-agent-LM-32B: Open-Source Code Agent
- SWE-agent-LM-32B is a 32-billion-parameter Transformer model designed for code-editing, debugging, and automated repository modifications.
- It employs large-scale supervised fine-tuning on synthetic and real-world datasets using a hybrid OEC+BC training regime to enhance multi-turn reasoning and test-time performance.
- Integrated in agent frameworks, the model leverages Dockerized environments and surrogate feedback models to enable robust, reproducible automation in software engineering tasks.
SWE-agent-LM-32B is a 32-billion-parameter, open-weight, Transformer-based code LLM, primarily fine-tuned and employed as the central reasoning engine in large-scale software engineering agent frameworks. It is one of the seminal open-source models benchmarked for code-editing, debugging, and automated codebase modification workflows engaging real-world repositories and agentic toolchains.
1. Architectural Foundation and Scale
SWE-agent-LM-32B is obtained by full-parameter supervised fine-tuning of Qwen2.5-Coder-Instruct-32B, a decoder-only Transformer with approximately 32B parameters. Standard configurations include 40–96 Transformer layers, hidden dimensions in the range 5,120–12,288, and 64–96 attention heads, supporting context lengths from 20 K to 128 K tokens, with RMS LayerNorm and rotary/relative positional encodings, depending on the exact fork (Yang et al., 30 Apr 2025, Wang et al., 9 Jun 2025, Zeng et al., 24 Jun 2025, Lauffer et al., 16 Dec 2025, Sun et al., 3 Feb 2026). No adapters or parameter-efficient modules are typically introduced; the model is a dense, full-finetune of a strong, instruction-tuned code LLM backbone.
2. Training Data Curation and Fine-Tuning Protocols
The core advances in SWE-agent-LM-32B stem from large-scale, validated, multi-turn trajectory datasets and hybrid fine-tuning regimes:
- Data Sources: SWE-smith produces 50k+ validated synthetic bug-fix tasks across 128 Python repositories using four strategies: LLM-based logic error injection (LM-Modify, LM-Rewrite), procedural AST mutation, and PR-diff inversion (PR-Mirror), supported by procedural Docker builds for reproducible per-task environments (Yang et al., 30 Apr 2025). Additional frameworks (Skywork-SWE, SWE-Dev) utilize up to ~10k real-world GitHub issues/PRs, systematically curated and validated by automated environment setup and unit-test classification (Wang et al., 9 Jun 2025, Zeng et al., 24 Jun 2025).
- Expert Trajectories: High-quality trajectories are generated with closed- or open-weight teacher LLMs (Claude, GLM-4.6, MiniMax-M2.1), filtered strictly using unit-test outcome labels. For instance, SWE-smith reports 5,016 successful gold trajectories out of 20k attempted for SFT, while SWE-Master curates ≥60k filtered, multi-turn interaction histories (Yang et al., 30 Apr 2025, Song et al., 3 Feb 2026).
- Imitation & On-policy Fine-Tuning: Traditional behavioral cloning (BC) uses full expert rollouts; On-policy Expert Corrections (OECs), adapted from DAgger, switch from student to expert mid-trajectory to mitigate covariate shift, with loss masked to expert-generated tokens. Supervised fine-tuning is always rejection-sampled by final patch test pass status (Lauffer et al., 16 Dec 2025).
- Hyperparameters: Most SFT pipelines adopt AdamW (lr 5×10⁻⁵), 2–5 epochs, context windows of 32 K–128 K, batch size adjusted for cluster size (16–256), with distributed or mixed-precision training (Yang et al., 30 Apr 2025, Wang et al., 9 Jun 2025, Zeng et al., 24 Jun 2025, Song et al., 3 Feb 2026).
Ablations indicate: (a) data scaling benefits are strictly monotonic in log–log (up to 60 K samples, with plateau ≈48K) (Song et al., 3 Feb 2026, Zeng et al., 24 Jun 2025), (b) repository diversity and realistic issue synthesis yield log-scale improvements, (c) trajectory filtering and rigorous rejection sampling are essential, and (d) OEC+BC training outperforms either alone (~13% relative improvement over pure BC on SWE-bench Verified) (Lauffer et al., 16 Dec 2025).
3. Agent–Environment Interaction and Execution Feedback
SWE-agent-LM-32B operates within agentic scaffolds such as SWE-Agent, R2E-Gym, and OpenHands. Standard workflows expose repository–navigation, code–edit, and shell–tool commands (open, edit, find_file, search_dir, pytest, submit, etc.) via a structured ACI, with deterministic command dispatching, linter enforcement, and prompt truncation to constrain context (Yang et al., 2024, Yang et al., 30 Apr 2025, Sun et al., 3 Feb 2026). Key interaction modalities include:
- Physical Execution Environments: Real containerized builds per repo (Docker images, environment scripts), with complete dependency installation and precise test feedback (Yang et al., 30 Apr 2025, Wang et al., 9 Jun 2025, Zeng et al., 24 Jun 2025, Song et al., 3 Feb 2026).
- Learned Surrogate Feedback: SWE-World introduces LLM-based surrogate transition (SWT) and reward (SWR) models—fine-tuned on real execution traces—that simulate intermediate and terminal feedback (stdout, stderr, exit codes, pass/fail), allowing agents to train and perform inference without executing code/tests in Docker, greatly accelerating experimentation (Sun et al., 3 Feb 2026).
- Agent-Environment Loop: At each episode step, the agent samples an action from the current state (full workspace and past thoughts), receives (simulated or real) feedback, and accumulates a trajectory until submit or budget exhaustion. Thought, action, and feedback histories are tokenized for the LLM (Wang et al., 9 Jun 2025, Yang et al., 2024).
4. Evaluation Frameworks and Test-Time Scaling
Evaluation is standardized on SWE-bench Verified (500 GitHub issues with developer-written tests) and SWE-Compass (2,000 instances across 8 tasks, 8 scenarios, 10 languages) (Xu et al., 7 Nov 2025, Zeng et al., 24 Jun 2025, Yang et al., 30 Apr 2025).
- Metrics: Primary is Pass@1 (fraction of problems resolved in one agent trace). For test-time scaling (TTS), k-candidate rollouts are executed and ranked; pass@k is (Yang et al., 30 Apr 2025, Jain et al., 9 Apr 2025).
- Test-Time Scaling (TTS): Critical for maximizing end-task performance. Approaches include:
- Repeat sampling (Best-of-N): select the candidate with highest critic or verifier score among N rollouts (Zeng et al., 24 Jun 2025, Sun et al., 3 Feb 2026).
- Hybrid verifiers (execution-based + execution-free): R2E-Gym reports additive gains (EB or EF saturate ~43%, but hybrid reaches 51% Best@26) (Jain et al., 9 Apr 2025).
- Surrogate verifiers: SWE-World and SWE-Master use LLM reward models for parallel/variance-reduced selection (e.g., N=8, M=3 sampling yields 68.2–70.8% resolution) (Sun et al., 3 Feb 2026, Song et al., 3 Feb 2026).
- Agentic Rubrics: structured, repo-grounded checklists graded by an LLM judge, outperforming standard patch classifiers for Best@16 selection (Qwen3-32B: 40.6%) (Raghavendra et al., 7 Jan 2026).
5. Quantitative Performance and Scaling Laws
| Model / Training Regime | Benchmark | Pass@1 | Best@N (if reported) | Reference |
|---|---|---|---|---|
| SWE-agent-LM-32B (SWE-smith) | SWE-bench Verified | 40.2% | — | (Yang et al., 30 Apr 2025) |
| Skywork-SWE-32B | SWE-bench Verified | 38.0% | 47.0% (TTS N=8) | (Zeng et al., 24 Jun 2025) |
| SWE-Dev-32B | SWE-bench Verified | 36.6% | — | (Wang et al., 9 Jun 2025) |
| R2E-Gym-32B (hybrid TTS) | SWE-bench Verified | 34.4% | 51.0% (Best@26) | (Jain et al., 9 Apr 2025) |
| SWE-agent-LM-32B (OEC+BC) | SWE-bench Verified | 40.0% | — | (Lauffer et al., 16 Dec 2025) |
| SA-SWE-32B | SWE-bench Verified | 39.4% | ~57% (Pass@5) | (Cao et al., 20 Nov 2025) |
| SWE-World-32B (+SFT+RL) | SWE-bench Verified | 55.0% | 68.2% (TTS@8) | (Sun et al., 3 Feb 2026) |
| SWE-Master-32B (+RL+TTS) | SWE-bench Verified | 61.4% | 70.8% (TTS@8) | (Song et al., 3 Feb 2026) |
Scaling analyses show (i) pass@1 improves nearly log-linearly with trajectory count, (ii) larger model capacity delays task saturation and enables deeper multi-step planning, (iii) hybrid and parallel inference pipelines can yield +8–12pp over pure execution or classification (Zeng et al., 24 Jun 2025, Song et al., 3 Feb 2026, Jain et al., 9 Apr 2025).
6. Generalization, Practicalities, and Limitations
SWE-agent-LM-32B demonstrates modest generalization to agentic tasks beyond Python code-editing, with noticeable improvements in Terminal-Bench, BrowseComp-Plus, and WebArena compared to non-trained baselines (Cao et al., 20 Nov 2025).
- Language coverage: SWE-Compass evaluations reveal considerable variance by language (e.g., Java: 30.5%, Python: 11.1%, C: 4.7%), with strong preference for JVM/CLR languages and GUI- or deployment-focused tasks (Xu et al., 7 Nov 2025).
- Agentic weaknesses: Relatively low performance on bug-fixing, infrastructure, and systems-language scenarios; also, trailing larger and proprietary models (e.g., Claude Sonnet-4) by >15 percentage points (Xu et al., 7 Nov 2025).
- Failure modes: Incomplete bug localization; repetitive or stuck action loops (up to 25% on synthetic data vs. <4% for top proprietary LLMs); high dependence on context window management for deep multi-file tasks (Yang et al., 30 Apr 2025, Lauffer et al., 16 Dec 2025, Song et al., 3 Feb 2026).
7. Infrastructure, Reproducibility, and Impact
All major datasets, codebases, and models are released under open licenses (Apache 2.0 or similar) for SWE-smith, SWE-World, Skywork-SWE, R2E-Gym, and SWE-Dev (Yang et al., 30 Apr 2025, Zeng et al., 24 Jun 2025, Wang et al., 9 Jun 2025, Sun et al., 3 Feb 2026, Song et al., 3 Feb 2026). Infrastructure best practices include modular Dockerized environments, decoupled runner-server pipelines for RL, rigorous curation scripts, and public evaluation harnesses, enabling reproducibility and cross-benchmarking.
SWE-agent-LM-32B establishes an extensible, high-performance baseline for agentic software engineering research, directly enabling robust studies in data scaling laws, RL for long-horizon code-editing, and hybrid execution–verifier frameworks. Its open-weight nature and documented training pipelines significantly lower the entry barrier for academic and community research in large-scale LLM-powered code agents.