- The paper presents CVEfixes, an automated tool that systematically collects vulnerabilities and their fixes from open-source repositories.
- It employs data extraction, JSON processing, and commit history analysis to provide granular details, including programming language metadata and CVSS scores.
- The dataset spans 5365 CVEs across 1754 projects, supporting applications in vulnerability prediction, classification, and automated patch analysis.
CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software
The paper, "CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software" (2107.08760), proposes an automated method to collect comprehensive datasets of software vulnerabilities and their corresponding fixes from open-source platforms. It introduces CVEfixes, a dataset meticulously constructed to support software security research by providing granular details of vulnerabilities derived from the National Vulnerability Database (NVD).
Methodology and Implementation
The authors developed an automated tool for mining CVE records, extracting relevant project-specific information from open-source repositories such as GitHub, GitLab, and Bitbucket. The tool aggregates and processes JSON feeds of CVE records and associates them with specific code changes extracted from commit histories.
Figure 1: Dataset construction workflow.
To improve the usability of the dataset for ML and security research, detailed metadata is appended to each record. This includes programming languages, metrics at multiple abstraction levels (e.g., file, class, method), and security metrics like CVSS scores.
Figure 2: Entity-Relationship Diagram showing the various levels of abstraction in CVEfixes and their interconnections.
Data Characteristics
The initial release of CVEfixes includes records of 5365 unique CVEs across 1754 open-source projects, organized at five abstraction levels. The dataset is distinguished by its inclusivity of multiple programming languages, structured query capabilities, and integration of CWE types, which allow for multi-faceted security analyses.
Applications
The paper highlights several potential applications of CVEfixes:
- Automated Vulnerability Prediction: By providing real-world examples of vulnerable and patched code, CVEfixes can train models for effective vulnerability prediction.
- Vulnerability Classification and Severity Prediction: With CWE categorization and CVSS score data, researchers can develop models to classify vulnerabilities and predict their severity.
- Patch Analysis and Automated Repair: The availability of both the pre- and post-fix code states aids in investigating vulnerability-fixing patterns and automating the repair process.
Statistical Insights
Exploratory analyses presented in the paper showcase the top projects with the most CVEs and fixing commits. Notable projects include Linux, which accounts for a majority of the dataset's vulnerabilities, reflecting its broad usage and scrutiny level.
Figure 3: Violin plot showing the distribution of average vulnerability severity scores for projects included in CVEfixes.
Limitations and Future Work
“CVEfixes” acknowledges challenges such as repositories no longer available or incomplete patches. Future enhancements include addressing these limitations, adding support for other version control systems beyond Git-based platforms, and implementing a more incremental update process.
Figure 4: Violin plot showing the distribution of average DMM scores for fixes to the projects included in CVEfixes.
Conclusion
The CVEfixes dataset is a valuable resource for advancing software security research and practice. By leveraging detailed coding and security metrics across diverse projects, it sets the foundation for future enhancements in software vulnerability detection, analysis, and repair methodologies.
In summary, CVEfixes contributes significantly to the repository of security datasets, demonstrating utility across vulnerability research dimensions and potentially informing future innovations in automated software security solutions.