Papers
Topics
Authors
Recent
Search
2000 character limit reached

An extensive empirical study of inconsistent labels in multi-version-project defect data sets

Published 27 Jan 2021 in cs.SE | (2101.11749v1)

Abstract: The label quality of defect data sets has a direct influence on the reliability of defect prediction models. In this study, for multi-version-project defect data sets, we propose an approach to automatically detecting instances with inconsistent labels (i.e. the phenomena of instances having the same source code but different labels over multiple versions of a software project) and understand their influence on the evaluation and interpretation of defect prediction models. Based on five multi-version-project defect data sets (either widely used or the most up-to-date in the literature) collected by diverse approaches, we find that: (1) most versions in the investigated defect data sets contain inconsistent labels with varying degrees; (2) the existence of inconsistent labels in a training data set may considerably change the prediction performance of a defect prediction model as well as can lead to the identification of substantially different true defective modules; and (3) the importance ranking of independent variables in a defect prediction model can be substantially shifted due to the existence of inconsistent labels. The above findings reveal that inconsistent labels in defect data sets can profoundly change the prediction ability and interpretation of a defect prediction model. Therefore, we strongly suggest that practitioners should detect and exclude inconsistent labels in defect data sets to avoid their potential negative influence on defect prediction models. What is more, it is necessary for researchers to improve existing defect label collection approaches to reduce inconsistent labels. Furthermore, there is a need to re-examine the experimental conclusions of previous studies using multi-version-project defect data sets with a high ratio of inconsistent labels.

Citations (2)

Summary

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.