Top Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection

Published 23 Aug 2024 in cs.CR and cs.LG | (2408.12986v2)

Abstract: According to our survey of machine learning for vulnerability detection (ML4VD), 9 in every 10 papers published in the past five years define ML4VD as a function-level binary classification problem: Given a function, does it contain a security flaw? From our experience as security researchers, faced with deciding whether a given function makes the program vulnerable to attacks, we would often first want to understand the context in which this function is called. In this paper, we study how often this decision can really be made without further context and study both vulnerable and non-vulnerable functions in the most popular ML4VD datasets. We call a function "vulnerable" if it was involved in a patch of an actual security flaw and confirmed to cause the program's vulnerability. It is "non-vulnerable" otherwise. We find that in almost all cases this decision cannot be made without further context. Vulnerable functions are often vulnerable only because a corresponding vulnerability-inducing calling context exists while non-vulnerable functions would often be vulnerable if a corresponding context existed. But why do ML4VD techniques achieve high scores even though there is demonstrably not enough information in these samples? Spurious correlations: We find that high scores can be achieved even when only word counts are available. This shows that these datasets can be exploited to achieve high scores without actually detecting any security vulnerabilities. We conclude that the prevailing problem statement of ML4VD is ill-defined and call into question the internal validity of this growing body of work. Constructively, we call for more effective benchmarking methodologies to evaluate the true capabilities of ML4VD, propose alternative problem statements, and examine broader implications for the evaluation of machine learning and programming analysis research.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper highlights that current ML vulnerability detection methods oversimplify by ignoring essential context, leading to inaccurate benchmarks.
It employs rigorous empirical analysis with manual labeling and validation, revealing significant label noise across popular datasets.
The study advocates context-aware classification strategies to overcome spurious correlations and enhance detection reliability.

Critical Analysis of "Top Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection"

The paper "Top Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection" by Niklas Risse and Marcel Böhme critically examines the current evaluation methodologies in the field of machine learning for vulnerability detection (ML4VD). The authors provide a thorough assessment by focusing on the conceptual soundness and empirical validity of ML4VD techniques across widely used datasets.

Problem Analysis

The authors initiated their study by surveying ML4VD literature over the past five years, encompassing the Top-4 Software Engineering conferences. They discovered a unanimous trend: ML4VD is typically cast as a binary classification problem, where isolated functions are evaluated to determine the presence or absence of vulnerabilities. This approach simplifies the problem by abstracting away critical contextual information. The paper identified major datasets, namely BigVul, Devign, and DiverseVul, as the most commonly employed benchmarks, underscoring their influence on ML4VD research directions.

Context-Dependency in Vulnerability Detection

The core argument of the paper is that function-level binary classification in ML4VD disregards the essential contextual dependencies that inherently determine security vulnerabilities. The authors conducted an empirical study to establish the prevalence of context-dependent vulnerabilities within these popular datasets.

Their findings revealed that vulnerability labels within these datasets are notably noisy. For example, only 38-64% of functions labeled as "vulnerable" indeed contained security vulnerabilities. Moreover, when investigating the true positives, they found that 100% of those vulnerabilities required additional context beyond the function itself, thus rendering the initial problem statement inadequate. Further analysis demonstrated that even many functions labeled as "secure" could potentially harbor vulnerabilities under specific contexts.

Numerical Results and Methodology

The robust methodological framework presented in the paper includes random sampling of functions, rigorous manual labeling, and validation processes. The study meticulously investigated 300 functions, revealing significant label inaccuracies and context dependency. Impressively, the manual labeling achieved substantial agreement with a Cohen Kappa value of 0.7, reinforcing the credibility of their findings.

One striking result from the empirical study is that the vulnerability of over 90% of functions could not be accurately determined without considering broader contexts. This empirically questions the soundness of utilizing isolated functions for vulnerability detection, as it overlooks the essential code-context interactions that manifest real security threats.

Implications for ML4VD and Future Directions

The implications of this study are multifold. The findings directly question the internal validity of a large body of ML4VD literature, suggesting that high reported accuracies are often artifacts of spurious correlations rather than genuine vulnerability detection capabilities. By training a Gradient Boosting Classifier on simple word counts, the authors highlighted that high classification performance could be achieved even when critical code structure information was disregarded. This underscores the susceptibility of current benchmarks to spurious feature exploitation.

For the field to progress towards meaningful evaluation and genuine vulnerability detection, the authors propose several alternatives:

Abstention-based Classification: Rather than forcing a binary classification, allowing ML models to abstain from making a decision on context-dependent cases could pave the way for more credible results.
Higher Granularities: Leveraging file-level or module-level contexts might provide a more holistic view, though this approach still necessitates empirical validation for robustness.
Context-Conditional Classification: Integrating a broader repository-level context during evaluation might address the core dependency issues.

Future Research

The paper steers future research towards context-aware methodologies that reflect the hierarchical and interconnected nature of software systems. The move towards context-generating techniques and the inclusion of comprehensive execution environments in the training and evaluation of ML models are particularly intriguing. Such advancements would mark a significant shift from traditional ML4VD paradigms, promoting techniques that can truly detect vulnerabilities in real-world settings.

Conclusion

Risse and Böhme's work delivers a critical and insightful discourse on the foundational assumptions underlying current ML4VD research. By empirically proving the inadequacy of function-level binary classification for vulnerability detection, they offer a compelling case for the reevaluation and redefinition of benchmarking methodologies in this domain. The proposed directions for future research hold promise for advancing the reliability and robustness of ML4VD, ensuring that empirical evaluations genuinely reflect the models' capabilities to safeguard software systems against vulnerabilities.