SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents

Published 29 Jan 2026 in cs.SE, cs.AI, and cs.LG | (2601.22129v1)

Abstract: Test-time scaling has been widely adopted to enhance the capabilities of LLM agents in software engineering (SWE) tasks. However, the standard approach of repeatedly sampling trajectories from scratch is computationally expensive. While recent methods have attempted to mitigate costs using specialized value agents, they can suffer from model miscalibration and fail to generalize to modern agents that synthesize custom bash scripts as tools. In this paper, we introduce SWE-Replay, the first efficient and generalizable test-time scaling technique for modern agents without reliance on potentially noisy value estimates. SWE-Replay optimizes the scaling process by recycling trajectories from prior trials, dynamically choosing to either explore from scratch or exploit archived experience by branching at critical intermediate steps. This selection of intermediate steps is driven by the potential and reasoning significance of repository exploration, rather than external LLM-based quality estimates. Our evaluation shows that, on SWE-Bench Verified, SWE-Replay consistently outperforms naive scaling, reducing costs by up to 17.4% while maintaining or even improving performance by up to 3.8%. Further evaluation on SWE-Bench Pro and Multilingual validates the generalizability of SWE-Replay, establishing it as a robust foundation for efficient test-time scaling of software engineering agents.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a test-time scaling framework that reuses trajectory samples to reduce compute costs by up to 17.4% while enhancing resolve rates.
It employs selective trajectory replay with a hierarchical step selection pipeline, relying on explicit program semantics rather than reward models.
Empirical results show improved efficiency and multilingual task performance, with ablation studies confirming the critical role of each pipeline component.

SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents

Problem Formulation and Motivation

The predominant methodology for scaling LLM agents in software engineering (SWE) tasks involves intensive test-time sampling, i.e., repeatedly generating trajectories from scratch. While this "naive scaling" strategy can significantly boost downstream solution quality, the computational cost is prohibitive, particularly in agentic frameworks operating over complex software repositories. Alternatives such as SWE-Search and Satori-SWE have integrated value models (e.g., reward- or judge-based), Monte Carlo Tree Search, or self-improvement cycles to reduce cost, but inherit fundamental failings: reliance on miscalibrated quality estimates, incompatibility with open-ended tool use, and poor adaptability to modern agentic scaffolds that synthesize custom bash scripts and leverage arbitrary action spaces.

SWE-Replay Framework

SWE-Replay addresses these limitations by proposing a test-time scaling paradigm that reuses prior sampling effort via selective trajectory replay and stochastic branch generation, eliminating the reliance on any explicit value/reward model.

The key components of SWE-Replay:

Trajectory Archive Construction: All generated trajectories are persistently archived, including detailed per-step environment and reasoning traces.
Stochastic Exploration-Exploitation Tradeoff: At each trial, a Bernoulli process decides whether to sample a trajectory from scratch (exploration) or to resume from an intermediate step of a prior trajectory (exploitation) with $p=0.5$ .
Step Selection Pipeline: Exploitation employs a hierarchical selector:
1. Filter low-quality trajectories by regression test failures.
2. Abstract state grouping: Steps are grouped based on the set of repository files explored prior, leveraging file-level abstraction to balance granularity and reuse.
3. Reasoning intensity selection: Prioritize branching at steps with structurally rich reasoning (measured by paragraph count rather than token length, shown empirically superior).
4. Softmax-based probabilistic sampling to ensure non-degenerate branching distributions.
Optimized Environment Restoration: Efficient restoration strategies leverage shallow file diffs except when changes in environment state (outside of the repository) are detected, drastically reducing replay overhead.
Branch Suffix Generation: After environment restoration, the agent generates only the suffix of the trajectory from the selected step onward.

Critically, all step selection is driven by explicit program analysis (e.g., test-based patch filtering, repository file coverage, reasoning segmentation), not LLM-predicted quality ratings, grounding SWE-Replay in observable program semantics and behavioral diversity rather than subjective prompt outputs.

Empirical Results

SWE-Replay achieves significant gains in both efficiency and effectiveness, as evidenced by extensive benchmarks:

SWE-Bench Verified: SWE-Replay delivers up to 17.4% reduction in computational cost while elevating the resolve rate by up to 3.8%. This holds robustly across diverse LLM backends (Gemini-2.5-Pro, Gemini-3-Pro, Devstral-Small-2) and both mini-SWE-agent and Live-SWE-agent scaffolds.
SWE-Bench Pro and Multilingual: The method further demonstrates up to 22.6% improvement in resolve rates on multilingual tasks, with a cost reduction of up to 9.0%.

Ablation studies show each of the step selection pipeline components is essential, and using agent- or reward-model-based scoring instead of the SWE-Replay pipeline degrades both cost and performance.

Notably, repository exploration diversity—as measured by visitation frequency "long-tail" access of files—substantially increases under SWE-Replay. Case analyses attribute critical bug resolution to this increased coverage, overcoming local optima where naive scaling fixates.

Theoretical Analysis

A probabilistic analysis formalizes the replay gain: whenever step selection prioritizes high-quality (or rare) regions of the search space above uniform random, SWE-Replay can be guaranteed (in expectation) to outperform naive scaling with strictly less compute per solution. The effectiveness of the selection mechanism (in particular, the filtering and reasoning intensity heuristics) is thus both supported empirically and underpinned by this formal lower-bound result.

Prior approaches for efficient test-time scaling rely on value functions (SWE-Search), reward models (Satori-SWE), or LLM-as-a-Judge for sample weighting. These are demonstrably susceptible to miscalibration and are not broadly compatible with agents generating open-ended action sequences, especially tool- and script-synthesis pipelines in state-of-the-art agentic scaffolds.

SWE-Replay stands out in the literature as the first framework to provide efficient and generalizable test-time scaling tailored directly to the realities of modern agent architectures—eschewing template-based pipelines, value models, and reward function engineering.

Implications and Future Directions

SWE-Replay establishes a robust, theoretically justified procedure for accelerating and improving agentic SWE pipelines, with immediate implications for real-world deployment:

Computational savings: The architecture reduces total compute demand, a critical factor in industrial and academic application, especially for long-horizon, multi-turn development tasks.
Enhanced search coverage: By algorithmically inducing diversity in program exploration, the framework supports more reliable resolution of complex and less-documented issues.
Greater generalizability: The decoupling from value or reward agents ensures applicability to novel scaffolds and unanticipated SWE problem distributions without retraining or prompt engineering.

Future work may investigate more advanced heuristics for reasoning intensity assessment, finer-grained abstractions for code state grouping, or automated adaptivity in exploration-exploitation schedule as a function of agent performance. Application to other agent-based domains beyond SWE—e.g., robotic planning with LLM-based controllers—could leverage the replay and exploitation principles introduced here.

Conclusion

SWE-Replay presents a computationally efficient, empirically validated, and theoretically grounded approach to test-time scaling for LLM-based software engineering agents. By leveraging replay from intermediate steps anchored by repository exploration and reasoning proxies, SWE-Replay consistently outperforms naive sampling on both efficiency and performance, establishing a new baseline for scalable agentic SWE pipelines (2601.22129).

Markdown Report Issue