Statistical significance of observed performance differences
Ascertain whether the observed performance differences between VulGNN and the baseline detectors (including ReVeal, RoBERTa, CodeBERT, GraphCodeBERT, GPT-2 Base, CodeGPT, PolyCoder, T5 Base, CodeT5 Small/Base, and NatGen) are statistically significant rather than due to chance by applying appropriate significance tests to precision, recall, F1-score, accuracy, and false positive rate results under the DiverseVul evaluation setups (“Train on Prev+Diverse, Test on Unseen Projects” and “Train and Test on Prev+Diverse with no overlaps”).
References
Finally, since we did not perform statistical tests, we cannot formally assess whether observed differences between models are significant or due to chance.
— Software Vulnerability Detection Using a Lightweight Graph Neural Network
(2603.29216 - Farmer et al., 31 Mar 2026) in Section 5.2.4, Conclusion Threats to Validity (Discussion)