VulEval: Towards Repository-Level Evaluation of Software Vulnerability Detection

Published 24 Apr 2024 in cs.SE and cs.CR | (2404.15596v1)

Abstract: Deep Learning (DL)-based methods have proven to be effective for software vulnerability detection, with a potential for substantial productivity enhancements for detecting vulnerabilities. Current methods mainly focus on detecting single functions (i.e., intra-procedural vulnerabilities), ignoring the more complex inter-procedural vulnerability detection scenarios in practice. For example, developers routinely engage with program analysis to detect vulnerabilities that span multiple functions within repositories. In addition, the widely-used benchmark datasets generally contain only intra-procedural vulnerabilities, leaving the assessment of inter-procedural vulnerability detection capabilities unexplored. To mitigate the issues, we propose a repository-level evaluation system, named \textbf{VulEval}, aiming at evaluating the detection performance of inter- and intra-procedural vulnerabilities simultaneously. Specifically, VulEval consists of three interconnected evaluation tasks: \textbf{(1) Function-Level Vulnerability Detection}, aiming at detecting intra-procedural vulnerability given a code snippet; \textbf{(2) Vulnerability-Related Dependency Prediction}, aiming at retrieving the most relevant dependencies from call graphs for providing developers with explanations about the vulnerabilities; and \textbf{(3) Repository-Level Vulnerability Detection}, aiming at detecting inter-procedural vulnerabilities by combining with the dependencies identified in the second task. VulEval also consists of a large-scale dataset, with a total of 4,196 CVE entries, 232,239 functions, and corresponding 4,699 repository-level source code in C/C++ programming languages. Our analysis highlights the current progress and future directions for software vulnerability detection.

Abstract PDF Upgrade to Chat

Citations (2)

View on Semantic Scholar

Summary

The paper introduces VulEval, a comprehensive framework for evaluating vulnerabilities from function-level analysis to repository-level context.
It leverages extensive datasets, including 4,196 CVE entries and 347,533 function dependencies, to assess various detection methods.
Experiments show that fine-tuning methods excel in function-level detection while incorporating repository context notably enhances models like ChatGPT.

VulEval: Towards Repository-Level Evaluation of Software Vulnerability Detection

Introduction

The paper introduces VulEval, a comprehensive evaluation framework designed specifically for assessing vulnerability detection at both intra-procedural and inter-procedural levels in software repositories. It addresses the increased incidence of software vulnerabilities and their real-world implications, such as significant financial losses and security breaches.

Background and Objectives

VulEval aims to bridge the gap between existing evaluation methods that predominantly focus on function-level vulnerability detection and the complexities involved in real-world scenarios, where vulnerabilities span across multiple files and even entire repositories. The framework categorizes vulnerability detection methods into four major types: program analysis-based, supervised learning-based, fine-tuning-based, and prompt-based techniques (Figure 1).

Figure 1: The four types of vulnerability detection methods.

Framework Architecture

Data Collection

VulEval collects a substantial dataset, including 4,196 CVE entries and over 347,533 function dependencies, focusing on the C/C++ programming languages. It extracts repository-level source code and vulnerability-related dependencies using static analysis tools, providing a vast dataset for thorough evaluation (Figure 2).

Evaluation Tasks

VulEval evaluates three interconnected tasks:

Function-level Vulnerability Detection: This task focuses on determining if a code snippet is vulnerable based solely on its content.
Vulnerability-Related Dependency Prediction: This task involves predicting which dependencies are relevant to the vulnerability, providing critical context for understanding the software's potential weaknesses.
Repository-level Vulnerability Detection: It integrates function-level predictions and dependency analysis to detect vulnerabilities that span multiple functions or files within a repository.
Figure 2: The overview of VulEval. Figure (a), (b), (c), and (d) denote the process of data collection, function-level vulnerability detection, vulnerability-related dependency prediction, and repository-level vulnerability detection, respectively.

Experimental Setup and Results

The experiments conducted using VulEval explore the performance of various detection methods under both random and time-split settings to simulate real-world applications.

Key Findings

Effectiveness of Fine-Tuning Methods: Fine-tuning-based methods generally yield superior results in detecting vulnerabilities at the function level. However, performance declines are noted in time-split settings due to their reliance on historical data.
Dependency Prediction: Lexical-based methods outperform semantic methods in identifying vulnerability-relevant dependencies, suggesting the need for more sophisticated retrieval techniques.
Repository-Level Enhancement: Introducing repository-level contexts enhances the detection capabilities, particularly for models like ChatGPT, which benefits significantly from broader contextual information.
Specific CWE Vulnerability Detection: The study highlights the strong performance of models like ChatGPT in detecting specific CWE vulnerabilities effectively, leveraging their capacity to understand and process extensive contextual information (Figure 3).

Figure 3: The experimental results of several vulnerability types, including CWE-190, CWE-400, CWE-415, CWE-416, and CWE-787. The green, blue, red, and yellow circles denote the results of Devign, RATS, PDBERT, and ChatGPT, respectively.

Discussion

The implications of VulEval extend to the development of more contextual awareness in vulnerability detection systems, particularly at the repository level. The paper suggests future research directions, such as improving retrieval techniques for dependency prediction and leveraging LLMs for specific vulnerability types.

Conclusion

VulEval serves as a pioneering framework for a comprehensive evaluation of software vulnerability detection, emphasizing the importance of repository-level insights. Future work will continue to enhance the framework's capabilities, focusing on more effective dependency identification and integration into holistic detection strategies.