Vulnerability Detection with Code Language Models: How Far Are We?

Published 27 Mar 2024 in cs.SE and cs.CL | (2403.18624v2)

Abstract: In the context of the rising interest in code LLMs (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting vulnerabilities. Our analysis reveals significant shortcomings in existing vulnerability datasets, including poor data quality, low label accuracy, and high duplication rates, leading to unreliable model performance in realistic vulnerability detection scenarios. Additionally, the evaluation methods used with these datasets are not representative of real-world vulnerability detection. To address these challenges, we introduce PrimeVul, a new dataset for training and evaluating code LMs for vulnerability detection. PrimeVul incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks while significantly expanding the dataset. It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues, alongside introducing more realistic evaluation metrics and settings. This comprehensive approach aims to provide a more accurate assessment of code LMs' performance in real-world conditions. Evaluating code LMs on PrimeVul reveals that existing benchmarks significantly overestimate the performance of these models. For instance, a state-of-the-art 7B model scored 68.26% F1 on BigVul but only 3.09% F1 on PrimeVul. Attempts to improve performance through advanced training techniques and larger models like GPT-3.5 and GPT-4 were unsuccessful, with results akin to random guessing in the most stringent settings. These findings underscore the considerable gap between current capabilities and the practical requirements for deploying code LMs in security roles, highlighting the need for more innovative research in this domain.

Abstract PDF HTML Upgrade to Chat

References (50)

Citations (25)

View on Semantic Scholar

Summary

The paper introduces PrimeVul, a new dataset that addresses noisy labels, data duplication, and inadequate evaluation metrics in vulnerability detection.
It employs temporal splitting and a novel Vulnerability Detection Score to provide a more realistic assessment of Code LM performance.
Results reveal that traditional benchmarks significantly overestimate Code LM capabilities, highlighting the need for innovative training and evaluation techniques.

Novel Challenges in Vulnerability Detection with Code LLMs: Insights from the PrimeVul Dataset

Overview of the Study

The efficacy of Code LLMs (Code LMs) in vulnerability detection has been a subject of research interest. Traditional datasets and benchmarks have presented various limitations that potentially overestimate the capabilities of these models. This paper introduces PrimeVul, a new dataset aimed at training and evaluating Code LMs in a more realistic and challenging environment for vulnerability detection. The study meticulously analyzes the shortcomings of existing benchmarks in terms of data quality issues and evaluation metrics, and proposes rigorous solutions, including a novel dataset and evaluation guidelines.

Limitations of Existing Datasets and Benchmarks

The study identifies critical limitations in current vulnerability detection benchmarks:

Noisy Labels: The dichotomy between automated and manual labeling has resulted in a tradeoff between dataset size and label accuracy. Automated labeling often introduces significant noise, while manual labeling, although accurate, is not scalable.
Data Duplication: A considerable amount of data duplication has been found across the training and testing sets in existing benchmarks, leading to inflated and misleading performance metrics.
Evaluation Metrics: Current benchmarks use accuracy and F1 scores as metrics, neither of which adequately reflect the practical utility of models. There is a need for metrics that consider false positive and false negative rates in context.

Introduction of PrimeVul

To address these limitations, PrimeVul employs a series of novel approaches:

Rigorous Data Collection and Labeling: PrimeVul utilizes algorithms that significantly improve label accuracy by leveraging expert analyses and unique commit changes. This process reduces data duplication and noise, making the dataset a more reliable benchmark.
Temporal Splitting and Novel Evaluation Metrics: PrimeVul introduces temporal data splitting to mitigate data leakage and proposes the Vulnerability Detection Score (VD-S) metric. VD-S measures the false negative rate at a configurable false positive rate threshold, providing a more realistic evaluation of model effectiveness.
Pairwise Evaluation: Beyond conventional evaluation methods, PrimeVul incorporates pairwise evaluations of vulnerable-benign function pairs. This method assesses a model’s nuanced understanding of code vulnerabilities.

Evaluation of Code LMs on PrimeVul

Code LMs evaluated on PrimeVul reveal illuminating insights:

Benchmark Overestimation: Existing benchmarks significantly overestimated model performance. For example, a state-of-the-art model achieved an F1 score of 68.26\% on BigVul but only 3.09\% on PrimeVul.
Challenges in Realistic Evaluation: Code LMs struggle in realistic settings, as highlighted by the considerable disparity in performance between PrimeVul and previously used datasets.
Advanced Training Techniques: Exploration of class weights and contrastive learning as advanced training techniques showed marginal improvements. Larger models, including GPT-3.5 and GPT-4, were also evaluated with limited success, emphasizing the need for novel approaches in model development for effective vulnerability detection.

Conclusion and Future Directions

The introduction of PrimeVul and the insights gained from its evaluation offer a stark depiction of the current capabilities of Code LMs in vulnerability detection. This work underscores the intricacy of deploying Code LMs in security roles and signals a call-to-action for innovative research efforts. Future directions might include enhancing model understanding of software security through pre-training modifications or hybrid methodologies combining Code LMs with traditional program analysis tools. Through continued exploration and adaptation, the field can strive toward models that better grasp and predict vulnerabilities in software code.

Markdown Report Issue