Uncovering the Limits of Machine Learning for Automatic Vulnerability Detection
Abstract: Recent results of machine learning for automatic vulnerability detection (ML4VD) have been very promising. Given only the source code of a function $f$, ML4VD techniques can decide if $f$ contains a security flaw with up to 70% accuracy. However, as evident in our own experiments, the same top-performing models are unable to distinguish between functions that contain a vulnerability and functions where the vulnerability is patched. So, how can we explain this contradiction and how can we improve the way we evaluate ML4VD techniques to get a better picture of their actual capabilities? In this paper, we identify overfitting to unrelated features and out-of-distribution generalization as two problems, which are not captured by the traditional approach of evaluating ML4VD techniques. As a remedy, we propose a novel benchmarking methodology to help researchers better evaluate the true capabilities and limits of ML4VD techniques. Specifically, we propose (i) to augment the training and validation dataset according to our cross-validation algorithm, where a semantic preserving transformation is applied during the augmentation of either the training set or the testing set, and (ii) to augment the testing set with code snippets where the vulnerabilities are patched. Using six ML4VD techniques and two datasets, we find (a) that state-of-the-art models severely overfit to unrelated features for predicting the vulnerabilities in the testing data, (b) that the performance gained by data augmentation does not generalize beyond the specific augmentations applied during training, and (c) that state-of-the-art ML4VD techniques are unable to distinguish vulnerable functions from their patches.
- Unified Pre-training for Program Understanding and Generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 2655–2668. https://www.aclweb.org/anthology/2021.naacl-main.211
- Assessing Robustness of ML-Based Program Analysis Tools using Metamorphic Program Transformations. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). 1377–1381. https://doi.org/10.1109/ASE51524.2021.9678706
- CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software. In Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering (Athens, Greece) (PROMISE 2021). Association for Computing Machinery, New York, NY, USA, 30–39. https://doi.org/10.1145/3475960.3475985
- Pavol Bielik and Martin Vechev. 2020. Adversarial Robustness for Code. In Proceedings of the 37th International Conference on Machine Learning (ICML’20). JMLR.org, Article 84, 12 pages.
- Paul Black. 2018. A Software Assurance Reference Dataset: Thousands of Programs With Known Bugs. https://doi.org/10.6028/jres.123.005
- The National Vulnerability Database (NVD): Overview. https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=915172
- Vul4J: A Dataset of Reproducible Java Vulnerabilities Geared towards the Study of Program Repair Techniques. In Proceedings of the 19th International Conference on Mining Software Repositories (Pittsburgh, Pennsylvania) (MSR ’22). Association for Computing Machinery, New York, NY, USA, 464–468. https://doi.org/10.1145/3524842.3528482
- Exploring software naturalness through neural language models. arXiv preprint arXiv:2006.12641 (2020).
- Tom Dietterich. 1995. Overfitting and Undercomputing in Machine Learning. ACM Comput. Surv. 27, 3 (sep 1995), 326–327. https://doi.org/10.1145/212094.212114
- CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 1536–1547. https://doi.org/10.18653/v1/2020.findings-emnlp.139
- FFmpeg 2023. FFmpeg GitHub repository. Retrieved March 8, 2023 from https://github.com/FFmpeg/FFmpeg
- Michael Fu and Chakkrit Tantithamthavorn. 2022. LineVul: A Transformer-based Line-Level Vulnerability Prediction. In 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR). IEEE.
- Hazim Hanif and Sergio Maffeis. 2022. VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection. In 2022 International Joint Conference on Neural Networks (IJCNN). 1–8. https://doi.org/10.1109/IJCNN55064.2022.9892280
- On Distribution Shift in Learning-based Bug Detectors. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR, 8559–8580. https://proceedings.mlr.press/v162/he22a.html
- The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 8340–8349.
- Semantic Robustness of Models of Source Code. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE. https://doi.org/10.1109/saner53432.2022.00070
- On the naturalness of software. In 2012 34th International Conference on Software Engineering (ICSE). 837–847. https://doi.org/10.1109/ICSE.2012.6227135
- Rafael-Michael Karampatsis and Charles Sutton. 2020. How Often Do Single-Statement Bugs Occur? The ManySStuBs4J Dataset. In 2020 IEEE/ACM 17th International Conference on Mining Software Repositories (MSR). 573–577. https://doi.org/10.1145/3379597.3387491
- A Closer Look into Transformer-Based Code Intelligence Through Code Transformation: Challenges and Opportunities. https://doi.org/10.48550/ARXIV.2207.04285
- Semantic-Preserving Adversarial Code Comprehension. In Proceedings of the 29th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Gyeongju, Republic of Korea, 3017–3028. https://aclanthology.org/2022.coling-1.267
- Towards Making Deep Learning-based Vulnerability Detectors Robust. https://doi.org/10.48550/ARXIV.2108.00669
- VulDeePecker: A Deep Learning-Based System for Vulnerability Detection. In Proceedings 2018 Network and Distributed System Security Symposium. Internet Society. https://doi.org/10.14722/ndss.2018.23158
- CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, J. Vanschoren and S. Yeung (Eds.), Vol. 1. https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/c16a5320fa475530d9583c34fd356ef5-Paper-round1.pdf
- Microsoft 2021. CodeXGLUE leaderboards. Retrieved March 8, 2023 from https://microsoft.github.io/CodeXGLUE/#LB-DefectDetection
- ML4CSec research group from the Department of Computing, Imperial College London 2022a. VulBERTa GitHub repository. Retrieved March 8, 2023 from https://github.com/ICL-ml4csec/VulBERTa
- ML4CSec research group from the Department of Computing, Imperial College London 2022b. VulDeePecker Function-level dataset. Retrieved March 8, 2023 from https://github.com/ICL-ml4csec/VulBERTa/tree/main/data
- MultIPAs: applying program transformations to introductory programming assignments for data augmentation. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1657–1661.
- Long Phan. 2023. CoTexT GitHub repository. Retrieved March 8, 2023 from https://github.com/justinphan3110/CoTexT
- CoTexT: Multi-task Learning with Code-Text Transformer. In Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021). Association for Computational Linguistics, Online, 40–47. https://doi.org/10.18653/v1/2021.nlp4prog-1.5
- A Manually-Curated Dataset of Fixes to Vulnerabilities of Open-Source Software. In Proceedings of the 16th International Conference on Mining Software Repositories (Montreal, Quebec, Canada) (MSR ’19). IEEE Press, 383–387. https://doi.org/10.1109/MSR.2019.00064
- Qemu 2023. Qemu GitHub repository. Retrieved March 8, 2023 from https://github.com/qemu/qemu
- On the generalizability of Neural Program Models with respect to semantic-preserving program transformations. Information and Software Technology 135 (jul 2021), 106552. https://doi.org/10.1016/j.infsof.2021.106552
- Toward Causal Representation Learning. Proc. IEEE 109, 5 (2021), 612–634. https://doi.org/10.1109/JPROC.2021.3058954
- Towards Out-Of-Distribution Generalization: A Survey. https://doi.org/10.48550/ARXIV.2108.13624
- Generating Adversarial Computer Programs using Optimized Obfuscations. In International Conference on Learning Representations. https://openreview.net/forum?id=PH5PH9ZO_4
- Ahmad Wasi. 2021. PLBart GitHub repository. Retrieved March 8, 2023 from https://github.com/wasiahmad/PLBART
- Natural Attack for Pre-Trained Models of Code. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 1482–1493. https://doi.org/10.1145/3510003.3510146
- Adversarial Examples for Models of Code. Proc. ACM Program. Lang. 4, OOPSLA, Article 162 (nov 2020), 30 pages. https://doi.org/10.1145/3428230
- Xue Ying. 2019. An Overview of Overfitting and its Solutions. Journal of Physics: Conference Series 1168, 2 (feb 2019), 022022. https://doi.org/10.1088/1742-6596/1168/2/022022
- Towards Robustness of Deep Program Processing Models—Detection, Estimation, and Enhancement. ACM Trans. Softw. Eng. Methodol. 31, 3, Article 50 (apr 2022), 40 pages. https://doi.org/10.1145/3511887
- Generating Adversarial Examples for Holding Robustness of Source Code Processing Models. In AAAI Conference on Artificial Intelligence.
- Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks. Curran Associates Inc., Red Hook, NY, USA.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.