OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution

Published 7 May 2025 in cs.SE | (2505.04606v1)

Abstract: The GitHub issue resolution task aims to resolve issues reported in repositories automatically. With advances in LLMs, this task has gained increasing attention, and several benchmarks are proposed to evaluate the issue resolution ability of LLMs. However, existing benchmarks have three main limitations. First, current benchmarks focus on a single programming language, limiting the evaluation of issues from repositories across different languages. Second, they usually cover a narrow range of domains, which may fail to represent the diversity of real-world issues. Third, existing benchmarks rely solely on textual information in issue descriptions, overlooking multimodal information such as images in issues. In this paper, we propose OmniGIRL, a GitHub Issue ResoLution benchmark that is multilingual, multimodal, and multi-domain. OmniGIRL includes 959 task instances, which are collected from repositories across four programming languages (i.e., Python, JavaScript, TypeScript, and Java) and eight different domains. Our evaluation shows that current LLMs show limited performances on OmniGIRL. Notably, the best-performing model, GPT-4o, resolves only 8.6% of the issues. Besides, we find that current LLMs struggle to resolve issues requiring understanding images. The best performance is achieved by Claude-3.5-Sonnet, which resolves only 10.5% of the issues with image information. Finally, we analyze the reasons behind current LLMs' failure on OmniGIRL, providing insights for future improvements.

Abstract PDF Upgrade to Chat

Summary

OmniGIRL: A Comprehensive Benchmark for GitHub Issue Resolution

The paper titled "OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution" introduces a sophisticated benchmark designed to evaluate the capabilities of Large Language Models (LLMs) in resolving GitHub issues within heterogeneous environments. This benchmark is named OmniGIRL and strives to overcome several limitations identified in existing benchmarks, notably the focus on a single programming language, narrow domain coverage, and reliance solely on textual information.

Benchmark Design

OmniGIRL presents a notable improvement over existing frameworks by introducing a multilingual, multimodal, and multi-domain approach to GitHub issue resolution. It contains 959 task instances derived from repositories spanning four programming languages—Python, JavaScript, TypeScript, and Java—and eight different domains. Each issue within this dataset can include diverse inputs, such as text, images, and website links. This diversity aims to reflect real-world tasks more accurately than previous benchmarks, offering a broader evaluation base for LLMs.

Evaluation and Results

In the evaluation, state-of-the-art LLMs were tested against OmniGIRL to assess their issue resolution capabilities. The analysis reveals that current models struggle with the complexities of the dataset, achieving low performance metrics. GPT-4o, assessed with the Agentless-X approach, was the highest performer, successfully resolving only 8.6% of the provided issues. Additionally, the challenges highlighted are exacerbated when issues involve image data, where models with visual abilities, such as Claude-3.5-Sonnet, achieved a maximum resolution rate of 10.5%.

The paper further analyzes the determinants of failure in these models, identifying key areas for improvement, such as LLMs tending to misformat output during localization stages and limited ability in resolving issues that require modifications across multiple files.

Implications

The implications of this research are substantial for both theoretical and practical facets of AI application in software engineering. From a theoretical perspective, OmniGIRL provides a rigorous testbed for exploring the boundaries of LLMs in complex issue resolution tasks and drives the development of more capable, robust models. Practically, the insights derived from analyzing current model limitations can guide enhancements to tools used by software developers, particularly in environments with multimodal data.

Future Directions

Potential future directions include elaborating frameworks that can leverage visual and website information more effectively, integrating advanced parsing strategies for multilingual contexts, and expanding the dataset with issues from additional languages and domains. The benchmark could serve as a foundational tool for developing these capabilities, pushing the frontier of intelligent code debugging and development support systems.

In conclusion, the paper on OmniGIRL presents a valuable contribution to the field of AI for software engineering. It lays out a comprehensive benchmark that challenges current LLMs and highlights essential areas for improvement, thus serving as a catalyst for future research and development in the domain.