- The paper introduces a test-time scaling framework that reuses trajectory samples to reduce compute costs by up to 17.4% while enhancing resolve rates.
- It employs selective trajectory replay with a hierarchical step selection pipeline, relying on explicit program semantics rather than reward models.
- Empirical results show improved efficiency and multilingual task performance, with ablation studies confirming the critical role of each pipeline component.
SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents
The predominant methodology for scaling LLM agents in software engineering (SWE) tasks involves intensive test-time sampling, i.e., repeatedly generating trajectories from scratch. While this "naive scaling" strategy can significantly boost downstream solution quality, the computational cost is prohibitive, particularly in agentic frameworks operating over complex software repositories. Alternatives such as SWE-Search and Satori-SWE have integrated value models (e.g., reward- or judge-based), Monte Carlo Tree Search, or self-improvement cycles to reduce cost, but inherit fundamental failings: reliance on miscalibrated quality estimates, incompatibility with open-ended tool use, and poor adaptability to modern agentic scaffolds that synthesize custom bash scripts and leverage arbitrary action spaces.
SWE-Replay Framework
SWE-Replay addresses these limitations by proposing a test-time scaling paradigm that reuses prior sampling effort via selective trajectory replay and stochastic branch generation, eliminating the reliance on any explicit value/reward model.
The key components of SWE-Replay:
Critically, all step selection is driven by explicit program analysis (e.g., test-based patch filtering, repository file coverage, reasoning segmentation), not LLM-predicted quality ratings, grounding SWE-Replay in observable program semantics and behavioral diversity rather than subjective prompt outputs.
Empirical Results
SWE-Replay achieves significant gains in both efficiency and effectiveness, as evidenced by extensive benchmarks:
- SWE-Bench Verified: SWE-Replay delivers up to 17.4% reduction in computational cost while elevating the resolve rate by up to 3.8%. This holds robustly across diverse LLM backends (Gemini-2.5-Pro, Gemini-3-Pro, Devstral-Small-2) and both mini-SWE-agent and Live-SWE-agent scaffolds.
- SWE-Bench Pro and Multilingual: The method further demonstrates up to 22.6% improvement in resolve rates on multilingual tasks, with a cost reduction of up to 9.0%.
Ablation studies show each of the step selection pipeline components is essential, and using agent- or reward-model-based scoring instead of the SWE-Replay pipeline degrades both cost and performance.
Notably, repository exploration diversity—as measured by visitation frequency "long-tail" access of files—substantially increases under SWE-Replay. Case analyses attribute critical bug resolution to this increased coverage, overcoming local optima where naive scaling fixates.
Theoretical Analysis
A probabilistic analysis formalizes the replay gain: whenever step selection prioritizes high-quality (or rare) regions of the search space above uniform random, SWE-Replay can be guaranteed (in expectation) to outperform naive scaling with strictly less compute per solution. The effectiveness of the selection mechanism (in particular, the filtering and reasoning intensity heuristics) is thus both supported empirically and underpinned by this formal lower-bound result.
Prior approaches for efficient test-time scaling rely on value functions (SWE-Search), reward models (Satori-SWE), or LLM-as-a-Judge for sample weighting. These are demonstrably susceptible to miscalibration and are not broadly compatible with agents generating open-ended action sequences, especially tool- and script-synthesis pipelines in state-of-the-art agentic scaffolds.
SWE-Replay stands out in the literature as the first framework to provide efficient and generalizable test-time scaling tailored directly to the realities of modern agent architectures—eschewing template-based pipelines, value models, and reward function engineering.
Implications and Future Directions
SWE-Replay establishes a robust, theoretically justified procedure for accelerating and improving agentic SWE pipelines, with immediate implications for real-world deployment:
- Computational savings: The architecture reduces total compute demand, a critical factor in industrial and academic application, especially for long-horizon, multi-turn development tasks.
- Enhanced search coverage: By algorithmically inducing diversity in program exploration, the framework supports more reliable resolution of complex and less-documented issues.
- Greater generalizability: The decoupling from value or reward agents ensures applicability to novel scaffolds and unanticipated SWE problem distributions without retraining or prompt engineering.
Future work may investigate more advanced heuristics for reasoning intensity assessment, finer-grained abstractions for code state grouping, or automated adaptivity in exploration-exploitation schedule as a function of agent performance. Application to other agent-based domains beyond SWE—e.g., robotic planning with LLM-based controllers—could leverage the replay and exploitation principles introduced here.
Conclusion
SWE-Replay presents a computationally efficient, empirically validated, and theoretically grounded approach to test-time scaling for LLM-based software engineering agents. By leveraging replay from intermediate steps anchored by repository exploration and reasoning proxies, SWE-Replay consistently outperforms naive sampling on both efficiency and performance, establishing a new baseline for scalable agentic SWE pipelines (2601.22129).