The curious case of the test set AUROC
Abstract: Whilst the size and complexity of ML models have rapidly and significantly increased over the past decade, the methods for assessing their performance have not kept pace. In particular, among the many potential performance metrics, the ML community stubbornly continues to use (a) the area under the receiver operating characteristic curve (AUROC) for a validation and test cohort (distinct from training data) or (b) the sensitivity and specificity for the test data at an optimal threshold determined from the validation ROC. However, we argue that considering scores derived from the test ROC curve alone gives only a narrow insight into how a model performs and its ability to generalise.
- Disadvantages of using the area under the receiver operating characteristic curve to assess imaging tests: a discussion and proposal for an alternative approach. \JournalTitleEuropean radiology 25, 932–939 (2015).
- The matthews correlation coefficient (mcc) should replace the roc auc as the standard metric for assessing binary classification. \JournalTitleBioData Mining 16, 1–23 (2023).
- Test Set AUC tools and examples. https://github.com/alonhzn/testAUC (2023).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.