Papers
Topics
Authors
Recent
Search
2000 character limit reached

SWE-agent-LM-32B: Open-Source Code Agent

Updated 6 February 2026
  • SWE-agent-LM-32B is a 32-billion-parameter Transformer model designed for code-editing, debugging, and automated repository modifications.
  • It employs large-scale supervised fine-tuning on synthetic and real-world datasets using a hybrid OEC+BC training regime to enhance multi-turn reasoning and test-time performance.
  • Integrated in agent frameworks, the model leverages Dockerized environments and surrogate feedback models to enable robust, reproducible automation in software engineering tasks.

SWE-agent-LM-32B is a 32-billion-parameter, open-weight, Transformer-based code LLM, primarily fine-tuned and employed as the central reasoning engine in large-scale software engineering agent frameworks. It is one of the seminal open-source models benchmarked for code-editing, debugging, and automated codebase modification workflows engaging real-world repositories and agentic toolchains.

1. Architectural Foundation and Scale

SWE-agent-LM-32B is obtained by full-parameter supervised fine-tuning of Qwen2.5-Coder-Instruct-32B, a decoder-only Transformer with approximately 32B parameters. Standard configurations include 40–96 Transformer layers, hidden dimensions in the range 5,120–12,288, and 64–96 attention heads, supporting context lengths from 20 K to 128 K tokens, with RMS LayerNorm and rotary/relative positional encodings, depending on the exact fork (Yang et al., 30 Apr 2025, Wang et al., 9 Jun 2025, Zeng et al., 24 Jun 2025, Lauffer et al., 16 Dec 2025, Sun et al., 3 Feb 2026). No adapters or parameter-efficient modules are typically introduced; the model is a dense, full-finetune of a strong, instruction-tuned code LLM backbone.

2. Training Data Curation and Fine-Tuning Protocols

The core advances in SWE-agent-LM-32B stem from large-scale, validated, multi-turn trajectory datasets and hybrid fine-tuning regimes:

  • Data Sources: SWE-smith produces 50k+ validated synthetic bug-fix tasks across 128 Python repositories using four strategies: LLM-based logic error injection (LM-Modify, LM-Rewrite), procedural AST mutation, and PR-diff inversion (PR-Mirror), supported by procedural Docker builds for reproducible per-task environments (Yang et al., 30 Apr 2025). Additional frameworks (Skywork-SWE, SWE-Dev) utilize up to ~10k real-world GitHub issues/PRs, systematically curated and validated by automated environment setup and unit-test classification (Wang et al., 9 Jun 2025, Zeng et al., 24 Jun 2025).
  • Expert Trajectories: High-quality trajectories are generated with closed- or open-weight teacher LLMs (Claude, GLM-4.6, MiniMax-M2.1), filtered strictly using unit-test outcome labels. For instance, SWE-smith reports 5,016 successful gold trajectories out of 20k attempted for SFT, while SWE-Master curates ≥60k filtered, multi-turn interaction histories (Yang et al., 30 Apr 2025, Song et al., 3 Feb 2026).
  • Imitation & On-policy Fine-Tuning: Traditional behavioral cloning (BC) uses full expert rollouts; On-policy Expert Corrections (OECs), adapted from DAgger, switch from student to expert mid-trajectory to mitigate covariate shift, with loss masked to expert-generated tokens. Supervised fine-tuning is always rejection-sampled by final patch test pass status (Lauffer et al., 16 Dec 2025).
  • Hyperparameters: Most SFT pipelines adopt AdamW (lr 5×10⁻⁵), 2–5 epochs, context windows of 32 K–128 K, batch size adjusted for cluster size (16–256), with distributed or mixed-precision training (Yang et al., 30 Apr 2025, Wang et al., 9 Jun 2025, Zeng et al., 24 Jun 2025, Song et al., 3 Feb 2026).

Ablations indicate: (a) data scaling benefits are strictly monotonic in log–log (up to 60 K samples, with plateau ≈48K) (Song et al., 3 Feb 2026, Zeng et al., 24 Jun 2025), (b) repository diversity and realistic issue synthesis yield log-scale improvements, (c) trajectory filtering and rigorous rejection sampling are essential, and (d) OEC+BC training outperforms either alone (~13% relative improvement over pure BC on SWE-bench Verified) (Lauffer et al., 16 Dec 2025).

3. Agent–Environment Interaction and Execution Feedback

SWE-agent-LM-32B operates within agentic scaffolds such as SWE-Agent, R2E-Gym, and OpenHands. Standard workflows expose repository–navigation, code–edit, and shell–tool commands (open, edit, find_file, search_dir, pytest, submit, etc.) via a structured ACI, with deterministic command dispatching, linter enforcement, and prompt truncation to constrain context (Yang et al., 2024, Yang et al., 30 Apr 2025, Sun et al., 3 Feb 2026). Key interaction modalities include:

  • Physical Execution Environments: Real containerized builds per repo (Docker images, environment scripts), with complete dependency installation and precise test feedback (Yang et al., 30 Apr 2025, Wang et al., 9 Jun 2025, Zeng et al., 24 Jun 2025, Song et al., 3 Feb 2026).
  • Learned Surrogate Feedback: SWE-World introduces LLM-based surrogate transition (SWT) and reward (SWR) models—fine-tuned on real execution traces—that simulate intermediate and terminal feedback (stdout, stderr, exit codes, pass/fail), allowing agents to train and perform inference without executing code/tests in Docker, greatly accelerating experimentation (Sun et al., 3 Feb 2026).
  • Agent-Environment Loop: At each episode step, the agent samples an action ata_t from the current state (full workspace and past thoughts), receives (simulated or real) feedback, and accumulates a trajectory until submit or budget exhaustion. Thought, action, and feedback histories are tokenized for the LLM (Wang et al., 9 Jun 2025, Yang et al., 2024).

4. Evaluation Frameworks and Test-Time Scaling

Evaluation is standardized on SWE-bench Verified (500 GitHub issues with developer-written tests) and SWE-Compass (2,000 instances across 8 tasks, 8 scenarios, 10 languages) (Xu et al., 7 Nov 2025, Zeng et al., 24 Jun 2025, Yang et al., 30 Apr 2025).

5. Quantitative Performance and Scaling Laws

Model / Training Regime Benchmark Pass@1 Best@N (if reported) Reference
SWE-agent-LM-32B (SWE-smith) SWE-bench Verified 40.2% (Yang et al., 30 Apr 2025)
Skywork-SWE-32B SWE-bench Verified 38.0% 47.0% (TTS N=8) (Zeng et al., 24 Jun 2025)
SWE-Dev-32B SWE-bench Verified 36.6% (Wang et al., 9 Jun 2025)
R2E-Gym-32B (hybrid TTS) SWE-bench Verified 34.4% 51.0% (Best@26) (Jain et al., 9 Apr 2025)
SWE-agent-LM-32B (OEC+BC) SWE-bench Verified 40.0% (Lauffer et al., 16 Dec 2025)
SA-SWE-32B SWE-bench Verified 39.4% ~57% (Pass@5) (Cao et al., 20 Nov 2025)
SWE-World-32B (+SFT+RL) SWE-bench Verified 55.0% 68.2% (TTS@8) (Sun et al., 3 Feb 2026)
SWE-Master-32B (+RL+TTS) SWE-bench Verified 61.4% 70.8% (TTS@8) (Song et al., 3 Feb 2026)

Scaling analyses show (i) pass@1 improves nearly log-linearly with trajectory count, (ii) larger model capacity delays task saturation and enables deeper multi-step planning, (iii) hybrid and parallel inference pipelines can yield +8–12pp over pure execution or classification (Zeng et al., 24 Jun 2025, Song et al., 3 Feb 2026, Jain et al., 9 Apr 2025).

6. Generalization, Practicalities, and Limitations

SWE-agent-LM-32B demonstrates modest generalization to agentic tasks beyond Python code-editing, with noticeable improvements in Terminal-Bench, BrowseComp-Plus, and WebArena compared to non-trained baselines (Cao et al., 20 Nov 2025).

  • Language coverage: SWE-Compass evaluations reveal considerable variance by language (e.g., Java: 30.5%, Python: 11.1%, C: 4.7%), with strong preference for JVM/CLR languages and GUI- or deployment-focused tasks (Xu et al., 7 Nov 2025).
  • Agentic weaknesses: Relatively low performance on bug-fixing, infrastructure, and systems-language scenarios; also, trailing larger and proprietary models (e.g., Claude Sonnet-4) by >15 percentage points (Xu et al., 7 Nov 2025).
  • Failure modes: Incomplete bug localization; repetitive or stuck action loops (up to 25% on synthetic data vs. <4% for top proprietary LLMs); high dependence on context window management for deep multi-file tasks (Yang et al., 30 Apr 2025, Lauffer et al., 16 Dec 2025, Song et al., 3 Feb 2026).

7. Infrastructure, Reproducibility, and Impact

All major datasets, codebases, and models are released under open licenses (Apache 2.0 or similar) for SWE-smith, SWE-World, Skywork-SWE, R2E-Gym, and SWE-Dev (Yang et al., 30 Apr 2025, Zeng et al., 24 Jun 2025, Wang et al., 9 Jun 2025, Sun et al., 3 Feb 2026, Song et al., 3 Feb 2026). Infrastructure best practices include modular Dockerized environments, decoupled runner-server pipelines for RL, rigorous curation scripts, and public evaluation harnesses, enabling reproducibility and cross-benchmarking.

SWE-agent-LM-32B establishes an extensible, high-performance baseline for agentic software engineering research, directly enabling robust studies in data scaling laws, RL for long-horizon code-editing, and hybrid execution–verifier frameworks. Its open-weight nature and documented training pipelines significantly lower the entry barrier for academic and community research in large-scale LLM-powered code agents.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SWE-agent-LM-32B.