Papers
Topics
Authors
Recent
Search
2000 character limit reached

CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software

Published 19 Jul 2021 in cs.SE, cs.AI, cs.CR, and cs.LG | (2107.08760v1)

Abstract: Data-driven research on the automated discovery and repair of security vulnerabilities in source code requires comprehensive datasets of real-life vulnerable code and their fixes. To assist in such research, we propose a method to automatically collect and curate a comprehensive vulnerability dataset from Common Vulnerabilities and Exposures (CVE) records in the public National Vulnerability Database (NVD). We implement our approach in a fully automated dataset collection tool and share an initial release of the resulting vulnerability dataset named CVEfixes. The CVEfixes collection tool automatically fetches all available CVE records from the NVD, gathers the vulnerable code and corresponding fixes from associated open-source repositories, and organizes the collected information in a relational database. Moreover, the dataset is enriched with meta-data such as programming language, and detailed code and security metrics at five levels of abstraction. The collection can easily be repeated to keep up-to-date with newly discovered or patched vulnerabilities. The initial release of CVEfixes spans all published CVEs up to 9 June 2021, covering 5365 CVE records for 1754 open-source projects that were addressed in a total of 5495 vulnerability fixing commits. CVEfixes supports various types of data-driven software security research, such as vulnerability prediction, vulnerability classification, vulnerability severity prediction, analysis of vulnerability-related code changes, and automated vulnerability repair.

Citations (110)

Summary

  • The paper presents CVEfixes, an automated tool that systematically collects vulnerabilities and their fixes from open-source repositories.
  • It employs data extraction, JSON processing, and commit history analysis to provide granular details, including programming language metadata and CVSS scores.
  • The dataset spans 5365 CVEs across 1754 projects, supporting applications in vulnerability prediction, classification, and automated patch analysis.

CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software

The paper, "CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software" (2107.08760), proposes an automated method to collect comprehensive datasets of software vulnerabilities and their corresponding fixes from open-source platforms. It introduces CVEfixes, a dataset meticulously constructed to support software security research by providing granular details of vulnerabilities derived from the National Vulnerability Database (NVD).

Methodology and Implementation

The authors developed an automated tool for mining CVE records, extracting relevant project-specific information from open-source repositories such as GitHub, GitLab, and Bitbucket. The tool aggregates and processes JSON feeds of CVE records and associates them with specific code changes extracted from commit histories. Figure 1

Figure 1: Dataset construction workflow.

To improve the usability of the dataset for ML and security research, detailed metadata is appended to each record. This includes programming languages, metrics at multiple abstraction levels (e.g., file, class, method), and security metrics like CVSS scores. Figure 2

Figure 2: Entity-Relationship Diagram showing the various levels of abstraction in CVEfixes and their interconnections.

Data Characteristics

The initial release of CVEfixes includes records of 5365 unique CVEs across 1754 open-source projects, organized at five abstraction levels. The dataset is distinguished by its inclusivity of multiple programming languages, structured query capabilities, and integration of CWE types, which allow for multi-faceted security analyses.

Applications

The paper highlights several potential applications of CVEfixes:

  • Automated Vulnerability Prediction: By providing real-world examples of vulnerable and patched code, CVEfixes can train models for effective vulnerability prediction.
  • Vulnerability Classification and Severity Prediction: With CWE categorization and CVSS score data, researchers can develop models to classify vulnerabilities and predict their severity.
  • Patch Analysis and Automated Repair: The availability of both the pre- and post-fix code states aids in investigating vulnerability-fixing patterns and automating the repair process.

Statistical Insights

Exploratory analyses presented in the paper showcase the top projects with the most CVEs and fixing commits. Notable projects include Linux, which accounts for a majority of the dataset's vulnerabilities, reflecting its broad usage and scrutiny level. Figure 3

Figure 3: Violin plot showing the distribution of average vulnerability severity scores for projects included in CVEfixes.

Limitations and Future Work

“CVEfixes” acknowledges challenges such as repositories no longer available or incomplete patches. Future enhancements include addressing these limitations, adding support for other version control systems beyond Git-based platforms, and implementing a more incremental update process. Figure 4

Figure 4: Violin plot showing the distribution of average DMM scores for fixes to the projects included in CVEfixes.

Conclusion

The CVEfixes dataset is a valuable resource for advancing software security research and practice. By leveraging detailed coding and security metrics across diverse projects, it sets the foundation for future enhancements in software vulnerability detection, analysis, and repair methodologies.

In summary, CVEfixes contributes significantly to the repository of security datasets, demonstrating utility across vulnerability research dimensions and potentially informing future innovations in automated software security solutions.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.