Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

Published 3 Apr 2025 in cs.SE, cs.AI, and cs.CL | (2504.02605v1)

Abstract: The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating LLMs across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation. Based on Multi-SWE-bench, we evaluate a series of state-of-the-art models using three representative methods (Agentless, SWE-agent, and OpenHands) and present a comprehensive analysis with key empirical insights. In addition, we launch a Multi-SWE-RL open-source community, aimed at building large-scale reinforcement learning (RL) training datasets for issue-resolving tasks. As an initial contribution, we release a set of 4,723 well-structured instances spanning seven programming languages, laying a solid foundation for RL research in this domain. More importantly, we open-source our entire data production pipeline, along with detailed tutorials, encouraging the open-source community to continuously contribute and expand the dataset. We envision our Multi-SWE-bench and the ever-growing Multi-SWE-RL community as catalysts for advancing RL toward its full potential, bringing us one step closer to the dawn of AGI.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a multilingual benchmark that evaluates LLMs on issue resolving using 1,632 expertly curated instances.
It details a systematic pipeline from repository selection to manual verification, comparing methods like Agentless, SWE-agent, and OpenHands.
The study highlights key factors such as issue type, description length, and patch characteristics that influence LLM performance.

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

Introduction

The paper "Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving" (2504.02605) addresses the challenge of evaluating LLMs on software engineering tasks, particularly issue resolving, across multiple programming languages. The current benchmarks, like SWE-bench, focus primarily on Python, limiting their effectiveness in assessing LLMs in diverse software ecosystems. Multi-SWE-bench expands this to include languages such as Java, TypeScript, JavaScript, Go, Rust, C, and C++, providing a comprehensive evaluation framework.

To build Multi-SWE-bench, the authors curated 1,632 high-quality instances from 2,456 candidates, annotated by 68 expert annotators. The benchmark is designed to evaluate state-of-the-art LLMs using varied methods like Agentless, SWE-agent, and OpenHands, with a focus on generalizability across languages. Additionally, the paper introduces the Multi-SWE-RL community for creating large-scale reinforcement learning datasets aimed at enhancing issue resolving capabilities.

Benchmark Construction

The construction of Multi-SWE-bench involves a systematic pipeline comprising five phases, from repository selection to manual verification.

Repository Selection: High-quality repositories from GitHub were selected based on criteria like popularity, maintenance, CI/CD support, and build viability.
Figure 1: Construction of Multi-SWE-bench.
Pull Request Crawling: Issue-resolving pull requests were filtered for linkages to issues, modifications to test files, and merges to the main branch.
Environment Determination: Docker-based runtime environments were constructed for faithful execution of tasks, necessitating dependency management and environment validation.
Pull Request Filtering: Semantic validation was performed to ensure bug-fixing effects without regressions.
Manual Verification: Comprehensive verification ensured the reliability of instances for evaluating LLMs, employing dual annotation and cross-review.

Evaluation and Results

The evaluation of Multi-SWE-bench involved testing nine LLMs across three methods, offering insights into their language-specific performance, methodological effectiveness, and repository characteristics.

Language-Specific Performance: While Python issues were resolved effectively, other languages posed challenges due to their unique paradigms and complexities.
Figure 2: Resolved rate (\%) on Multi-SWE-bench (Claude-3.5-Sonnet).
Methodological Comparison: MopenHands generally outperformed MagentLess and MSWE-agent, highlighting the benefits of flexible workflows.
Repository Characteristics: Higher resolved rates correlated with metrics indicating repository activity and engagement, such as stars and forks.
Figure 3: Relationship between resolved rate and the number of stars and forks of a repository.

Figure 4: Relationship between resolved rate and the number of issues and PRs of a repository.

Influencing Factors

The paper identifies several factors influencing the resolving performance:

Issue Type: Bug fixes were addressed more effectively than new features or optimizations, underscoring LLMs' limitations in handling semantically complex tasks.
Description Length: The length of issue descriptions impacted performance variably, depending on whether descriptions were detailed or indicative of complexity.
Figure 5: Influence of issue description length on resolved rate.
Fix Patch Characteristics: Longer patches or those involving multiple files proved more challenging for existing methodologies.
Figure 6: Influence of fix patch length on resolved rate.

Figure 7: Influence of the number of files modified by fix patches on resolved rate.

Multi-SWE-RL Community

The Multi-SWE-RL community is established to foster collaborative contributions and expand the dataset for scalable RL environments. The initial release includes containerized instances across multiple languages, encouraging community involvement in enhancing data quality and breadth.

Conclusion

Multi-SWE-bench and the associated initiatives provide a robust framework for evaluating LLMs in multilingual issue resolving scenarios. The study emphasizes extending benchmarks to cover diverse languages and tasks, advocating for scalable RL environments. Future works may incorporate broader software engineering challenges, setting the stage for holistic AGI advancements.

Markdown