Finetuning Large Language Models for Vulnerability Detection
Abstract: This paper presents the results of finetuning LLMs for the task of detecting vulnerabilities in source code. We leverage WizardCoder, a recent improvement of the state-of-the-art LLM StarCoder, and adapt it for vulnerability detection through further finetuning. To accelerate training, we modify WizardCoder's training procedure, also we investigate optimal training regimes. For the imbalanced dataset with many more negative examples than positive, we also explore different techniques to improve classification performance. The finetuned WizardCoder model achieves improvement in ROC AUC and F1 measures on balanced and imbalanced vulnerability datasets over CodeBERT-like model, demonstrating the effectiveness of adapting pretrained LLMs for vulnerability detection in source code. The key contributions are finetuning the state-of-the-art code LLM, WizardCoder, increasing its training speed without the performance harm, optimizing the training procedure and regimes, handling class imbalance, and improving performance on difficult vulnerability detection datasets. This demonstrates the potential for transfer learning by finetuning large pretrained LLMs for specialized source code analysis tasks.
- CVEfixes: automated collection of vulnerabilities and their fixes from open-source software. In Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering (Athens, Greece) (PROMISE 2021). Association for Computing Machinery, New York, NY, USA, 30–39. https://doi.org/10.1145/3475960.3475985
- Deep Learning Based Vulnerability Detection: Are We There Yet? IEEE Transactions on Software Engineering 48, 09 (sep 2022), 3280–3296. https://doi.org/10.1109/TSE.2021.3087402
- Evaluation of ChatGPT Model for Vulnerability Detection. arXiv:2304.07232Â [cs.CR]
- Data Quality for Software Vulnerability Datasets. In Proceedings of the 45th International Conference on Software Engineering (Melbourne, Victoria, Australia) (ICSE ’23). IEEE Press, 121–133. https://doi.org/10.1109/ICSE48619.2023.00022
- CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 1536–1547. https://doi.org/10.18653/v1/2020.findings-emnlp.139
- Michael Fu and Chakkrit Tantithamthavorn. 2022. LineVul: A Transformer-based Line-Level Vulnerability Prediction. In 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR). 608–620. https://doi.org/10.1145/3524842.3528452
- LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations. https://openreview.net/forum?id=nZeVKeeFYf9
- StarCoder: may the source be with you! arXiv:2305.06161Â [cs.CL]
- Focal Loss for Dense Object Detection. In 2017 IEEE International Conference on Computer Vision (ICCV). 2999–3007. https://doi.org/10.1109/ICCV.2017.324
- ContraBERT: Enhancing Code Pre-Trained Models via Contrastive Learning. In Proceedings of the 45th International Conference on Software Engineering (Melbourne, Victoria, Australia) (ICSE ’23). IEEE Press, 2476–2487. https://doi.org/10.1109/ICSE48619.2023.00207
- WizardCoder: Empowering Code Large Language Models with Evol-Instruct. arXiv:2306.08568Â [cs.CL]
- PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods. https://github.com/huggingface/peft.
- CodeGen2: Lessons for Training LLMs on Programming and Natural Languages. arXiv:2305.02309Â [cs.LG]
- A manually-curated dataset of fixes to vulnerabilities of open-source software. In Proceedings of the 16th International Conference on Mining Software Repositories (Montreal, Quebec, Canada) (MSR ’19). IEEE Press, 383–387. https://doi.org/10.1109/MSR.2019.00064
- VCMatch: A Ranking-based Approach for Automatic Security Patches Localization for OSS Vulnerabilities. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 589–600. https://doi.org/10.1109/SANER53432.2022.00076
- CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (, Long Beach, CA, USA,) (KDD ’23). Association for Computing Machinery, New York, NY, USA, 5673–5684. https://doi.org/10.1145/3580305.3599790
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.