Vulnerability Detection with Code Language Models: How Far Are We?
Abstract: In the context of the rising interest in code LLMs (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting vulnerabilities. Our analysis reveals significant shortcomings in existing vulnerability datasets, including poor data quality, low label accuracy, and high duplication rates, leading to unreliable model performance in realistic vulnerability detection scenarios. Additionally, the evaluation methods used with these datasets are not representative of real-world vulnerability detection. To address these challenges, we introduce PrimeVul, a new dataset for training and evaluating code LMs for vulnerability detection. PrimeVul incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks while significantly expanding the dataset. It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues, alongside introducing more realistic evaluation metrics and settings. This comprehensive approach aims to provide a more accurate assessment of code LMs' performance in real-world conditions. Evaluating code LMs on PrimeVul reveals that existing benchmarks significantly overestimate the performance of these models. For instance, a state-of-the-art 7B model scored 68.26% F1 on BigVul but only 3.09% F1 on PrimeVul. Attempts to improve performance through advanced training techniques and larger models like GPT-3.5 and GPT-4 were unsuccessful, with results akin to random guessing in the most stringent settings. These findings underscore the considerable gap between current capabilities and the practical requirements for deploying code LMs in security roles, highlighting the need for more innovative research in this domain.
- Large language models for software engineering: A systematic literature review, 2024.
- GitHub. Github copilot: Your ai pair programmer. https://copilot.github.com/, 2021.
- Amazon. Amazon codewhisperer: Build applications faster and more securely with your ai coding companion. https://aws.amazon.com/codewhisperer/, 2023.
- Deep learning based vulnerability detection: Are we there yet. IEEE Transactions on Software Engineering, 2021.
- Diversevul: A new vulnerable source code dataset for deep learning based vulnerability detection. In Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses, RAID ’23, page 654–668, New York, NY, USA, 2023. Association for Computing Machinery.
- An empirical study of deep learning models for vulnerability detection. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 2237–2248, 2023.
- Linevul: A transformer-based line-level vulnerability prediction. In 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), pages 608–620, 2022.
- Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664, 2021.
- Ac/c++ code vulnerability dataset with code changes and cve summaries. In Proceedings of the 17th International Conference on Mining Software Repositories, pages 508–512, 2020.
- Cvefixes: automated collection of vulnerabilities and their fixes from open-source software. In Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering, pages 30–39, 2021.
- Crossvul: a cross-language vulnerability dataset with commit data. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1565–1569, 2021.
- Large language models for code: Security hardening and adversarial testing. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, CCS ’23, page 1865–1879, New York, NY, USA, 2023. Association for Computing Machinery.
- Starcoder 2 and the stack v2: The next generation, 2024.
- OpenAI. Gpt-4 technical report, 2024.
- Vuldeepecker: A deep learning-based system for vulnerability detection. arXiv preprint arXiv:1801.01681, 2018.
- Sysevr: A framework for using deep learning to detect software vulnerabilities. IEEE Transactions on Dependable and Secure Computing, 19(4):2244–2258, 2021.
- Report on the static analysis tool exposition (sate) iv. NIST Special Publication, 500:297, 2013.
- National Institute of Standards and Technology. Nist software assurance reference dataset, Last accessed on March 19, 2023.
- Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. Advances in neural information processing systems, 32, 2019.
- Limits of machine learning for automatic vulnerability detection, 2023.
- Confident learning: Estimating uncertainty in dataset labels. J. Artif. Int. Res., 70:1373–1411, may 2021.
- Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155, 2020.
- Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366, 2020.
- UniXcoder: Unified cross-modal pre-training for code representation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7212–7225, Dublin, Ireland, May 2022. Association for Computational Linguistics.
- Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859, 2021.
- CodeT5+: Open code large language models for code understanding and generation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1069–1088, Singapore, December 2023. Association for Computational Linguistics.
- Do language models learn semantics of code? a case study in vulnerability detection, 2023.
- Microsoft. Codexglue – defect detection, 2019.
- Benjamin Steenhoek. Hugging face datasets, 2024.
- Miltiadis Allamanis. The adverse effects of code duplication in machine learning models of code. In Proceedings of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, Onward! 2019, page 143–153, New York, NY, USA, 2019. Association for Computing Machinery.
- Learning natural coding conventions. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, page 281–293, New York, NY, USA, 2014. Association for Computing Machinery.
- A survey of machine learning for big code and naturalness. ACM Comput. Surv., 51(4), jul 2018.
- Memorization and generalization in neural code intelligence models. Information and Software Technology, 153:107066, 2023.
- Concord: Clone-aware contrastive learning for source code. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2023, page 26–38, New York, NY, USA, 2023. Association for Computing Machinery.
- The counterfeit conundrum: Can code language models grasp the nuances of their incorrect generations? arXiv preprint arXiv:2402.19475, 2024.
- Codegen2: Lessons for training llms on programming and natural languages, 2023.
- Training language models to follow instructions with human feedback, 2022.
- Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
- Survey on deep learning with class imbalance. Journal of Big Data, 2019.
- SimCSE: Simple contrastive learning of sentence embeddings. In Empirical Methods in Natural Language Processing (EMNLP), 2021.
- Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, jan 2014.
- Understanding the effectiveness of large language models in detecting security vulnerabilities. arXiv preprint arXiv:2311.16169, 2023.
- Enhancing static analysis for practical bug detection: An llm-integrated approach. In Proceedings of Proceedings of the ACM on Programming Languages (PACMPL), Issue OOPSLA, 2024.
- Llm4vuln: A unified evaluation framework for decoupling and enhancing llms’ vulnerability reasoning. arXiv preprint arXiv:2401.16185, 2024.
- Unified pre-training for program understanding and generation. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2668, Online, June 2021. Association for Computational Linguistics.
- Starcoder: may the source be with you! Transactions on Machine Learning Research, 2023. Reproducibility Certification.
- Can large language models identify and reason about security vulnerabilities? not yet. arXiv preprint arXiv:2312.12575, 2023.
- The larger they are, the harder they fail: Language models do not recognize identifier swaps in python. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 272–292, Toronto, Canada, July 2023. Association for Computational Linguistics.
- Can large language models identify and reason about security vulnerabilities? Not yet, December 2023.
- Data quality for software vulnerability datasets. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.