Statistical significance of observed performance differences

Ascertain whether the observed performance differences between VulGNN and the baseline detectors (including ReVeal, RoBERTa, CodeBERT, GraphCodeBERT, GPT-2 Base, CodeGPT, PolyCoder, T5 Base, CodeT5 Small/Base, and NatGen) are statistically significant rather than due to chance by applying appropriate significance tests to precision, recall, F1-score, accuracy, and false positive rate results under the DiverseVul evaluation setups (“Train on Prev+Diverse, Test on Unseen Projects” and “Train and Test on Prev+Diverse with no overlaps”).

Background

The study evaluates VulGNN against a range of language-model and GNN baselines on DiverseVul using multiple metrics (precision, recall, F1-score, accuracy, and false positive rate) under both in-distribution and unseen-project configurations. Although results are reported, no formal statistical tests are conducted to determine whether performance differences are statistically significant.

The authors explicitly note the absence of significance testing, leaving unresolved whether the observed performance gaps reflect genuine model differences or sampling variability. Establishing statistical significance would strengthen the evidence supporting comparative claims in real-world vulnerability detection contexts.

References

Finally, since we did not perform statistical tests, we cannot formally assess whether observed differences between models are significant or due to chance.

Software Vulnerability Detection Using a Lightweight Graph Neural Network  (2603.29216 - Farmer et al., 31 Mar 2026) in Section 5.2.4, Conclusion Threats to Validity (Discussion)