Papers
Topics
Authors
Recent
Search
2000 character limit reached

Finetuning Large Language Models for Vulnerability Detection

Published 30 Jan 2024 in cs.CR, cs.AI, and cs.LG | (2401.17010v5)

Abstract: This paper presents the results of finetuning LLMs for the task of detecting vulnerabilities in source code. We leverage WizardCoder, a recent improvement of the state-of-the-art LLM StarCoder, and adapt it for vulnerability detection through further finetuning. To accelerate training, we modify WizardCoder's training procedure, also we investigate optimal training regimes. For the imbalanced dataset with many more negative examples than positive, we also explore different techniques to improve classification performance. The finetuned WizardCoder model achieves improvement in ROC AUC and F1 measures on balanced and imbalanced vulnerability datasets over CodeBERT-like model, demonstrating the effectiveness of adapting pretrained LLMs for vulnerability detection in source code. The key contributions are finetuning the state-of-the-art code LLM, WizardCoder, increasing its training speed without the performance harm, optimizing the training procedure and regimes, handling class imbalance, and improving performance on difficult vulnerability detection datasets. This demonstrates the potential for transfer learning by finetuning large pretrained LLMs for specialized source code analysis tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)
  1. CVEfixes: automated collection of vulnerabilities and their fixes from open-source software. In Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering (Athens, Greece) (PROMISE 2021). Association for Computing Machinery, New York, NY, USA, 30–39. https://doi.org/10.1145/3475960.3475985
  2. Deep Learning Based Vulnerability Detection: Are We There Yet? IEEE Transactions on Software Engineering 48, 09 (sep 2022), 3280–3296. https://doi.org/10.1109/TSE.2021.3087402
  3. Evaluation of ChatGPT Model for Vulnerability Detection. arXiv:2304.07232 [cs.CR]
  4. Data Quality for Software Vulnerability Datasets. In Proceedings of the 45th International Conference on Software Engineering (Melbourne, Victoria, Australia) (ICSE ’23). IEEE Press, 121–133. https://doi.org/10.1109/ICSE48619.2023.00022
  5. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 1536–1547. https://doi.org/10.18653/v1/2020.findings-emnlp.139
  6. Michael Fu and Chakkrit Tantithamthavorn. 2022. LineVul: A Transformer-based Line-Level Vulnerability Prediction. In 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR). 608–620. https://doi.org/10.1145/3524842.3528452
  7. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations. https://openreview.net/forum?id=nZeVKeeFYf9
  8. StarCoder: may the source be with you! arXiv:2305.06161 [cs.CL]
  9. Focal Loss for Dense Object Detection. In 2017 IEEE International Conference on Computer Vision (ICCV). 2999–3007. https://doi.org/10.1109/ICCV.2017.324
  10. ContraBERT: Enhancing Code Pre-Trained Models via Contrastive Learning. In Proceedings of the 45th International Conference on Software Engineering (Melbourne, Victoria, Australia) (ICSE ’23). IEEE Press, 2476–2487. https://doi.org/10.1109/ICSE48619.2023.00207
  11. WizardCoder: Empowering Code Large Language Models with Evol-Instruct. arXiv:2306.08568 [cs.CL]
  12. PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods. https://github.com/huggingface/peft.
  13. CodeGen2: Lessons for Training LLMs on Programming and Natural Languages. arXiv:2305.02309 [cs.LG]
  14. A manually-curated dataset of fixes to vulnerabilities of open-source software. In Proceedings of the 16th International Conference on Mining Software Repositories (Montreal, Quebec, Canada) (MSR ’19). IEEE Press, 383–387. https://doi.org/10.1109/MSR.2019.00064
  15. VCMatch: A Ranking-based Approach for Automatic Security Patches Localization for OSS Vulnerabilities. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 589–600. https://doi.org/10.1109/SANER53432.2022.00076
  16. CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (, Long Beach, CA, USA,) (KDD ’23). Association for Computing Machinery, New York, NY, USA, 5673–5684. https://doi.org/10.1145/3580305.3599790
Citations (16)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.