Papers
Topics
Authors
Recent
Search
2000 character limit reached

Uncovering the Limits of Machine Learning for Automatic Vulnerability Detection

Published 28 Jun 2023 in cs.CR and cs.LG | (2306.17193v2)

Abstract: Recent results of machine learning for automatic vulnerability detection (ML4VD) have been very promising. Given only the source code of a function $f$, ML4VD techniques can decide if $f$ contains a security flaw with up to 70% accuracy. However, as evident in our own experiments, the same top-performing models are unable to distinguish between functions that contain a vulnerability and functions where the vulnerability is patched. So, how can we explain this contradiction and how can we improve the way we evaluate ML4VD techniques to get a better picture of their actual capabilities? In this paper, we identify overfitting to unrelated features and out-of-distribution generalization as two problems, which are not captured by the traditional approach of evaluating ML4VD techniques. As a remedy, we propose a novel benchmarking methodology to help researchers better evaluate the true capabilities and limits of ML4VD techniques. Specifically, we propose (i) to augment the training and validation dataset according to our cross-validation algorithm, where a semantic preserving transformation is applied during the augmentation of either the training set or the testing set, and (ii) to augment the testing set with code snippets where the vulnerabilities are patched. Using six ML4VD techniques and two datasets, we find (a) that state-of-the-art models severely overfit to unrelated features for predicting the vulnerabilities in the testing data, (b) that the performance gained by data augmentation does not generalize beyond the specific augmentations applied during training, and (c) that state-of-the-art ML4VD techniques are unable to distinguish vulnerable functions from their patches.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Unified Pre-training for Program Understanding and Generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 2655–2668. https://www.aclweb.org/anthology/2021.naacl-main.211
  2. Assessing Robustness of ML-Based Program Analysis Tools using Metamorphic Program Transformations. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). 1377–1381. https://doi.org/10.1109/ASE51524.2021.9678706
  3. CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software. In Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering (Athens, Greece) (PROMISE 2021). Association for Computing Machinery, New York, NY, USA, 30–39. https://doi.org/10.1145/3475960.3475985
  4. Pavol Bielik and Martin Vechev. 2020. Adversarial Robustness for Code. In Proceedings of the 37th International Conference on Machine Learning (ICML’20). JMLR.org, Article 84, 12 pages.
  5. Paul Black. 2018. A Software Assurance Reference Dataset: Thousands of Programs With Known Bugs. https://doi.org/10.6028/jres.123.005
  6. The National Vulnerability Database (NVD): Overview. https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=915172
  7. Vul4J: A Dataset of Reproducible Java Vulnerabilities Geared towards the Study of Program Repair Techniques. In Proceedings of the 19th International Conference on Mining Software Repositories (Pittsburgh, Pennsylvania) (MSR ’22). Association for Computing Machinery, New York, NY, USA, 464–468. https://doi.org/10.1145/3524842.3528482
  8. Exploring software naturalness through neural language models. arXiv preprint arXiv:2006.12641 (2020).
  9. Tom Dietterich. 1995. Overfitting and Undercomputing in Machine Learning. ACM Comput. Surv. 27, 3 (sep 1995), 326–327. https://doi.org/10.1145/212094.212114
  10. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 1536–1547. https://doi.org/10.18653/v1/2020.findings-emnlp.139
  11. FFmpeg 2023. FFmpeg GitHub repository. Retrieved March 8, 2023 from https://github.com/FFmpeg/FFmpeg
  12. Michael Fu and Chakkrit Tantithamthavorn. 2022. LineVul: A Transformer-based Line-Level Vulnerability Prediction. In 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR). IEEE.
  13. Hazim Hanif and Sergio Maffeis. 2022. VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection. In 2022 International Joint Conference on Neural Networks (IJCNN). 1–8. https://doi.org/10.1109/IJCNN55064.2022.9892280
  14. On Distribution Shift in Learning-based Bug Detectors. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR, 8559–8580. https://proceedings.mlr.press/v162/he22a.html
  15. The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 8340–8349.
  16. Semantic Robustness of Models of Source Code. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE. https://doi.org/10.1109/saner53432.2022.00070
  17. On the naturalness of software. In 2012 34th International Conference on Software Engineering (ICSE). 837–847. https://doi.org/10.1109/ICSE.2012.6227135
  18. Rafael-Michael Karampatsis and Charles Sutton. 2020. How Often Do Single-Statement Bugs Occur? The ManySStuBs4J Dataset. In 2020 IEEE/ACM 17th International Conference on Mining Software Repositories (MSR). 573–577. https://doi.org/10.1145/3379597.3387491
  19. A Closer Look into Transformer-Based Code Intelligence Through Code Transformation: Challenges and Opportunities. https://doi.org/10.48550/ARXIV.2207.04285
  20. Semantic-Preserving Adversarial Code Comprehension. In Proceedings of the 29th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Gyeongju, Republic of Korea, 3017–3028. https://aclanthology.org/2022.coling-1.267
  21. Towards Making Deep Learning-based Vulnerability Detectors Robust. https://doi.org/10.48550/ARXIV.2108.00669
  22. VulDeePecker: A Deep Learning-Based System for Vulnerability Detection. In Proceedings 2018 Network and Distributed System Security Symposium. Internet Society. https://doi.org/10.14722/ndss.2018.23158
  23. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, J. Vanschoren and S. Yeung (Eds.), Vol. 1. https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/c16a5320fa475530d9583c34fd356ef5-Paper-round1.pdf
  24. Microsoft 2021. CodeXGLUE leaderboards. Retrieved March 8, 2023 from https://microsoft.github.io/CodeXGLUE/#LB-DefectDetection
  25. ML4CSec research group from the Department of Computing, Imperial College London 2022a. VulBERTa GitHub repository. Retrieved March 8, 2023 from https://github.com/ICL-ml4csec/VulBERTa
  26. ML4CSec research group from the Department of Computing, Imperial College London 2022b. VulDeePecker Function-level dataset. Retrieved March 8, 2023 from https://github.com/ICL-ml4csec/VulBERTa/tree/main/data
  27. MultIPAs: applying program transformations to introductory programming assignments for data augmentation. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1657–1661.
  28. Long Phan. 2023. CoTexT GitHub repository. Retrieved March 8, 2023 from https://github.com/justinphan3110/CoTexT
  29. CoTexT: Multi-task Learning with Code-Text Transformer. In Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021). Association for Computational Linguistics, Online, 40–47. https://doi.org/10.18653/v1/2021.nlp4prog-1.5
  30. A Manually-Curated Dataset of Fixes to Vulnerabilities of Open-Source Software. In Proceedings of the 16th International Conference on Mining Software Repositories (Montreal, Quebec, Canada) (MSR ’19). IEEE Press, 383–387. https://doi.org/10.1109/MSR.2019.00064
  31. Qemu 2023. Qemu GitHub repository. Retrieved March 8, 2023 from https://github.com/qemu/qemu
  32. On the generalizability of Neural Program Models with respect to semantic-preserving program transformations. Information and Software Technology 135 (jul 2021), 106552. https://doi.org/10.1016/j.infsof.2021.106552
  33. Toward Causal Representation Learning. Proc. IEEE 109, 5 (2021), 612–634. https://doi.org/10.1109/JPROC.2021.3058954
  34. Towards Out-Of-Distribution Generalization: A Survey. https://doi.org/10.48550/ARXIV.2108.13624
  35. Generating Adversarial Computer Programs using Optimized Obfuscations. In International Conference on Learning Representations. https://openreview.net/forum?id=PH5PH9ZO_4
  36. Ahmad Wasi. 2021. PLBart GitHub repository. Retrieved March 8, 2023 from https://github.com/wasiahmad/PLBART
  37. Natural Attack for Pre-Trained Models of Code. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22). Association for Computing Machinery, New York, NY, USA, 1482–1493. https://doi.org/10.1145/3510003.3510146
  38. Adversarial Examples for Models of Code. Proc. ACM Program. Lang. 4, OOPSLA, Article 162 (nov 2020), 30 pages. https://doi.org/10.1145/3428230
  39. Xue Ying. 2019. An Overview of Overfitting and its Solutions. Journal of Physics: Conference Series 1168, 2 (feb 2019), 022022. https://doi.org/10.1088/1742-6596/1168/2/022022
  40. Towards Robustness of Deep Program Processing Models—Detection, Estimation, and Enhancement. ACM Trans. Softw. Eng. Methodol. 31, 3, Article 50 (apr 2022), 40 pages. https://doi.org/10.1145/3511887
  41. Generating Adversarial Examples for Holding Robustness of Source Code Processing Models. In AAAI Conference on Artificial Intelligence.
  42. Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks. Curran Associates Inc., Red Hook, NY, USA.
Citations (12)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.