Effectiveness of Unit Tests in SWE-Gym Raw

Determine whether the unit tests accompanying instances in the SWE-Gym Raw dataset effectively evaluate the correctness of proposed solutions, i.e., whether these tests provide reliable and sufficient validation signals when executable environments are not available.

Background

SWE-Gym Raw comprises 66,894 Python GitHub issue instances extracted from 358 repositories using the SWE-Bench instance extraction script. While these instances include code, issue descriptions, and solutions, they lack executable environments and have not been validated for test effectiveness.

Because executable environments are essential for running tests and obtaining reliable feedback, the authors focus on a subset of 11 repositories where they manually configure dependencies and validate instances via unit tests, yielding the executable SWE-Gym dataset of 2,438 instances. The open question concerns the unvalidated unit tests in SWE-Gym Raw: whether they are themselves effective at assessing solution correctness without the accompanying environment setup.

References

And it's unclear if the unit tests are effective in evaluating the correctness of a solution.

Training Software Engineering Agents and Verifiers with SWE-Gym  (2412.21139 - Pan et al., 2024) in Section: Dataset Construction — Extract Training Instances from Repositories