DiverseVul: C/C++ Vulnerability Dataset
- DiverseVul is a comprehensive dataset of vulnerable and non-vulnerable C/C++ functions curated from security-fix commits with detailed CWE annotations.
- It contains nearly 350K functions, advanced preprocessing, and token normalization to ensure high quality and semantic consistency in the data.
- The dataset serves as a benchmark for deep learning models, driving advancements in software vulnerability detection through robust statistical coverage and provenance.
The DiverseVul dataset is a large-scale, real-world corpus of vulnerable and non-vulnerable C/C++ functions curated specifically for training and benchmarking deep learning models in software vulnerability detection. Sourced from vulnerability-fixing commits in open-source projects, it provides fine-grained function-level examples annotated with Common Weakness Enumeration (CWE) metadata, representing a broad array of security weakness patterns. With nearly 349,500 functions and detailed provenance for each code snippet, DiverseVul establishes a new benchmark in dataset scale, coverage, and semantic diversity for vulnerability detection research (Gonçalves et al., 10 Mar 2025, Chen et al., 2023).
1. Curation Methodology and Label Assignment
DiverseVul is constructed through systematic mining of security-issue trackers and git repositories. Primary data sources include Snyk, Bugzilla Red Hat, CVEFixes, and associated public trackers. Curation proceeds as follows:
- Project Collection: 797 open-source C/C++ projects are cloned from GitHub, comprising both high-profile and niche domains, with 295 projects not previously covered by other datasets.
- Commit Filtering: Only commits explicitly tagged or described as "fixing a security issue" are considered. Commits affecting more than ten functions undergo manual review to avoid refactorings or non-security changes.
- Function Extraction: For each fix commit, all modified functions are extracted both pre- (vulnerable) and post-fix (non-vulnerable). Functions that are extremely short or contain only comments are excluded.
- Labeling: Vulnerability labels are derived via fix-commit metadata—functions in the pre-fix snapshot are marked as "vulnerable," and their post-fix counterparts as "non-vulnerable." No automated static analysis tools are employed; instead, labeling relies on regular expression parsing, commit message heuristics, CVE/CWE mapping, and NVD lookups.
- CWE Annotation: When a CWE identifier is present in commit metadata or messages, it is directly assigned. If only CVE references are present, the NVD is used to infer CWE(s).
Deduplication is performed using byte-level MD5 hashes of function contents, ensuring identical functions are not redundantly included.
2. Dataset Statistics and Weakness Coverage
DiverseVul’s scale and breadth enable robust machine learning experimentation:
- Size and Class Distribution
- Vulnerable functions: 18,945
- Non-vulnerable functions: 330,492
- Total: 349,437
- Vulnerable function fraction:
- Number of unique fix commits: 7,514
- CWE Representation
- 150 distinct CWEs in the vulnerable set
- Top categories: Buffer Errors (CWE-787, CWE-119, etc.; ~25%), Input Validation (CWE-20, CWE-125; ~20%), Resource Management (~55% distributed among 147 additional CWEs)
- Detailed distribution: Out-of-bounds Write (CWE-787, 38.5%), Out-of-bounds Read (CWE-125, 24.9%), Classic Buffer Overflow (CWE-119, 21.7%), Improper Input Validation (CWE-20, 17.5%), Exception Handling Errors (CWE-703, 16.3%), Use-After-Free (CWE-416, 13.4%), and others.
This extensive CWE coverage, including many low-frequency weakness types, enables investigation of rare vulnerability patterns and generalization across security smells.
3. Preprocessing, Quality Control, and Refinement
Significant preprocessing is undertaken to address issues discovered in the initial release:
- Parsing and Error Handling: 62,239 functions failing to parse due to macros or structural issues are removed.
- De-duplication and Conflicts: 7,901 duplicate or conflicting entries (including differing only by function name or label ambiguity) are eliminated, preferring post-fix (non-vulnerable) code if code is identical.
- Token Normalization (SCoPE pipeline):
- Function-level parsing to an internal tree structure.
- Removal of comments and whitespace normalization.
- All user-defined identifiers (function, variable names) replaced with generic tokens (FUNC_1, VAR_1, etc.); literals replaced with placeholders.
- Regular expression transformations for stylistic standardization.
- Signature-based duplicate hashing to remove duplicates post-normalization.
- All functions with zero executable statements are pruned.
The result is a "refined" DiverseVul: a set of unique, well-formed, correctly labeled functions with minimal label noise compared to previous datasets, as confirmed by spot checks indicating approximately 60% label accuracy (significantly higher than earlier datasets, some of which were as low as 25%).
4. File Structure, Metadata, and Example Transformations
DiverseVul is distributed as CSV or JSON with the following columns:
| Column | Type | Description |
|---|---|---|
| func | string | Post-processed C/C++ function source |
| target | binary (0/1) | 1 = vulnerable, 0 = non-vulnerable |
| cwe | list of integers | CWE identifiers for the function |
| project | string | GitHub repository name |
| commit_id | string | SHA of fixing commit |
| size | integer | Number of lines or tokens |
| message | string | Original commit message |
Typical transformation via SCoPE:
Before Preprocessing
1 2 3 4 5 6 |
/* check index */ int myCopy(char *buf, int len) { if (len > BUF_SIZ) return -1; strncpy(buf, input, len); return 0; } |
After Preprocessing
1 2 3 4 5 |
int FUNC_1(VAR_1 VAR_2, VAR_3 VAR_4) { if (VAR_4 > VAR_5) return VAR_6; FUNC_2(VAR_1, VAR_7, VAR_4); return VAR_6; } |
5. Model Benchmarking and Empirical Findings
DiverseVul forms the foundation for extensive benchmarking of deep learning architectures (Gonçalves et al., 10 Mar 2025, Chen et al., 2023). Reported results distinguish between baseline and improved preprocessing, and between state-of-the-art models:
- Performance Metrics: Evaluations report accuracy, precision, recall, and F1-score, with
- Key Observed Results:
| Model | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|
| NatGen | 92 % | 52 % | 43 % | 47 % |
| LLaMA 3.2 | 66 % | 65 % | 67 % | 66 % |
- Impact of Preprocessing: Preprocessing and function normalization enhance model performance, e.g., increasing F1-score from 62% to 63% in pilot runs, and enabling LLaMA 3.2 to outperform strong prior baselines by 19 percentage points F1.
A comprehensive evaluation in (Chen et al., 2023) covers 11 models spanning GNNs, transformer-based LLMs, and code-specific pretraining objectives. LLMs with such objectives (e.g., identifier masking, code “naturalizing”) consistently outperform models trained solely on next-token objectives or generic masked language modeling.
6. Challenges, Comparative Advantages, and Research Directions
DiverseVul sets new standards in several respects:
- Scale and Diversity: It more than doubles previous corpus sizes for C/C++ vulnerabilities, spans 150 CWEs, and covers nearly 800 projects, with substantial representation from both popular and previously-unseen codebases.
- Semantic Breadth: Inclusion of a long tail of rare CWEs, and strict before/after-fix matching for labeling, facilitates research on generalization and rare weaknesses.
- Automatic Abstraction: SCoPE abstraction eliminates superficial identifier bias, better aligning data with the semantics required for robust vulnerability detection.
- LLM Utility: The dataset’s form and scale are particularly conducive to LLM training; binary as well as multi-label classification (for CWE) can be directly benchmarked.
- Real-World Relevance: Unlike synthetic benchmarks (e.g., Juliet, SATE), DiverseVul is drawn entirely from live Git histories of real software projects, ensuring ecological validity.
However, several open challenges persist (Chen et al., 2023):
- False Positive Rates: Even the best models exhibit FPR ≈3.5%, which is operationally problematic at scale.
- Cross-project Generalization: Models generalize poorly to unseen projects, with F1 dropping from ~49% (seen) to ~9% (unseen)—a major barrier to deployment in CI pipelines.
- Label Noise and Attribution: Vulnerabilities spanning multiple functions, irrelevant changes, and remaining label noise challenge model fidelity. More robust labeling, improved commit attribution, and aggressive deduplication are ongoing needs.
A plausible implication is that while continued data volume growth benefits models within a fixed distribution, achieving practical generalization will require advances in both pretraining objectives and dataset curation.
7. Impact and Future Directions
DiverseVul accelerates empirical study into mechanisms of vulnerability detection, particularly within the context of LLMs. It motivates several research directions:
- Enhancement of code-specific pretraining objectives (e.g., identifier masking, data-flow prediction) for improved semantic comprehension.
- Development of methods for robust learning under label noise and prevalence imbalance.
- Rigorous evaluation protocols that stratify by project or code domain for fair benchmarking of cross-project generalization.
- Investigation of techniques for reducing false positives to achieve triage-viable deployment in industry.
DiverseVul is thus a pivotal resource for both method development and benchmarking in machine learning for software security, setting the stage for future progress in real-world vulnerability detection and automated security analysis (Gonçalves et al., 10 Mar 2025, Chen et al., 2023).