Necessity and content of images in SWE-bench task instances

Determine, for the subset of SWE-bench task instances that include an image, what the image depicts and whether the image is necessary for solving the corresponding task instance, in order to clarify the role of visual information within SWE-bench problem statements.

Background

SWE-bench is a widely used benchmark for autonomous software engineering systems but is predominantly text-based and focused on Python repositories. A small fraction of SWE-bench task instances include images, yet their role and importance have not been characterized.

The paper highlights that for this image-containing subset of SWE-bench, it is currently unclear what the visuals convey and whether they are essential to the problem-solving process. The authors’ new multimodal benchmark investigates these questions in JavaScript contexts, but the original uncertainty for SWE-bench remains unaddressed and thus is an explicit open issue.

References

For the 5.6% of SWE-bench task instances with an image, it is unclear what these images portray and whether they are necessary to solving the task.

— SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? (2410.03859 - Yang et al., 2024) in Section 2.1 (Preliminaries), Limitations

Necessity and content of images in SWE-bench task instances

Background

References

Related Problems