DiverseVul: C/C++ Vulnerability Dataset

Updated 10 February 2026

DiverseVul is a comprehensive dataset of vulnerable and non-vulnerable C/C++ functions curated from security-fix commits with detailed CWE annotations.
It contains nearly 350K functions, advanced preprocessing, and token normalization to ensure high quality and semantic consistency in the data.
The dataset serves as a benchmark for deep learning models, driving advancements in software vulnerability detection through robust statistical coverage and provenance.

The DiverseVul dataset is a large-scale, real-world corpus of vulnerable and non-vulnerable C/C++ functions curated specifically for training and benchmarking deep learning models in software vulnerability detection. Sourced from vulnerability-fixing commits in open-source projects, it provides fine-grained function-level examples annotated with Common Weakness Enumeration (CWE) metadata, representing a broad array of security weakness patterns. With nearly 349,500 functions and detailed provenance for each code snippet, DiverseVul establishes a new benchmark in dataset scale, coverage, and semantic diversity for vulnerability detection research (Gonçalves et al., 10 Mar 2025, Chen et al., 2023).

1. Curation Methodology and Label Assignment

DiverseVul is constructed through systematic mining of security-issue trackers and git repositories. Primary data sources include Snyk, Bugzilla Red Hat, CVEFixes, and associated public trackers. Curation proceeds as follows:

Project Collection: 797 open-source C/C++ projects are cloned from GitHub, comprising both high-profile and niche domains, with 295 projects not previously covered by other datasets.
Commit Filtering: Only commits explicitly tagged or described as "fixing a security issue" are considered. Commits affecting more than ten functions undergo manual review to avoid refactorings or non-security changes.
Function Extraction: For each fix commit, all modified functions are extracted both pre- (vulnerable) and post-fix (non-vulnerable). Functions that are extremely short or contain only comments are excluded.
Labeling: Vulnerability labels are derived via fix-commit metadata—functions in the pre-fix snapshot are marked as "vulnerable," and their post-fix counterparts as "non-vulnerable." No automated static analysis tools are employed; instead, labeling relies on regular expression parsing, commit message heuristics, CVE/CWE mapping, and NVD lookups.
CWE Annotation: When a CWE identifier is present in commit metadata or messages, it is directly assigned. If only CVE references are present, the NVD is used to infer CWE(s).

Deduplication is performed using byte-level MD5 hashes of function contents, ensuring identical functions are not redundantly included.

2. Dataset Statistics and Weakness Coverage

DiverseVul’s scale and breadth enable robust machine learning experimentation:

Size and Class Distribution
- Vulnerable functions: 18,945
- Non-vulnerable functions: 330,492
- Total: 349,437
- Vulnerable function fraction: $5.42\%$
- Number of unique fix commits: 7,514
CWE Representation
- 150 distinct CWEs in the vulnerable set
- Top categories: Buffer Errors (CWE-787, CWE-119, etc.; ~25%), Input Validation (CWE-20, CWE-125; ~20%), Resource Management (~55% distributed among 147 additional CWEs)
- Detailed distribution: Out-of-bounds Write (CWE-787, 38.5%), Out-of-bounds Read (CWE-125, 24.9%), Classic Buffer Overflow (CWE-119, 21.7%), Improper Input Validation (CWE-20, 17.5%), Exception Handling Errors (CWE-703, 16.3%), Use-After-Free (CWE-416, 13.4%), and others.

This extensive CWE coverage, including many low-frequency weakness types, enables investigation of rare vulnerability patterns and generalization across security smells.

Significant preprocessing is undertaken to address issues discovered in the initial release:

Parsing and Error Handling: 62,239 functions failing to parse due to macros or structural issues are removed.
De-duplication and Conflicts: 7,901 duplicate or conflicting entries (including differing only by function name or label ambiguity) are eliminated, preferring post-fix (non-vulnerable) code if code is identical.
Token Normalization (SCoPE pipeline):

Function-level parsing to an internal tree structure.
Removal of comments and whitespace normalization.
All user-defined identifiers (function, variable names) replaced with generic tokens (FUNC_1, VAR_1, etc.); literals replaced with placeholders.
Regular expression transformations for stylistic standardization.
Signature-based duplicate hashing to remove duplicates post-normalization.
All functions with zero executable statements are pruned.

The result is a "refined" DiverseVul: a set of unique, well-formed, correctly labeled functions with minimal label noise compared to previous datasets, as confirmed by spot checks indicating approximately 60% label accuracy (significantly higher than earlier datasets, some of which were as low as 25%).

4. File Structure, Metadata, and Example Transformations

DiverseVul is distributed as CSV or JSON with the following columns:

Column	Type	Description
func	string	Post-processed C/C++ function source
target	binary (0/1)	1 = vulnerable, 0 = non-vulnerable
cwe	list of integers	CWE identifiers for the function
project	string	GitHub repository name
commit_id	string	SHA of fixing commit
size	integer	Number of lines or tokens
message	string	Original commit message

Typical transformation via SCoPE:

Before Preprocessing

/* check index */
int myCopy(char *buf, int len) {
    if (len > BUF_SIZ) return -1;
    strncpy(buf, input, len);
    return 0;
}

After Preprocessing

int FUNC_1(VAR_1 VAR_2, VAR_3 VAR_4) {
    if (VAR_4 > VAR_5) return VAR_6;
    FUNC_2(VAR_1, VAR_7, VAR_4);
    return VAR_6;
}

This abstraction enforces identifier- and literal-invariance, focusing modeling on semantic structure rather than superficial lexical cues.

5. Model Benchmarking and Empirical Findings

DiverseVul forms the foundation for extensive benchmarking of deep learning architectures (Gonçalves et al., 10 Mar 2025, Chen et al., 2023). Reported results distinguish between baseline and improved preprocessing, and between state-of-the-art models:

Performance Metrics: Evaluations report accuracy, precision, recall, and F1-score, with

$F_1 = 2 \times \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}}$

Key Observed Results:

Model	Accuracy	Precision	Recall	F1-Score
NatGen	92 %	52 %	43 %	47 %
LLaMA 3.2	66 %	65 %	67 %	66 %

Impact of Preprocessing: Preprocessing and function normalization enhance model performance, e.g., increasing F1-score from 62% to 63% in pilot runs, and enabling LLaMA 3.2 to outperform strong prior baselines by 19 percentage points F1.

A comprehensive evaluation in (Chen et al., 2023) covers 11 models spanning GNNs, transformer-based LLMs, and code-specific pretraining objectives. LLMs with such objectives (e.g., identifier masking, code “naturalizing”) consistently outperform models trained solely on next-token objectives or generic masked language modeling.

6. Challenges, Comparative Advantages, and Research Directions

DiverseVul sets new standards in several respects:

Scale and Diversity: It more than doubles previous corpus sizes for C/C++ vulnerabilities, spans 150 CWEs, and covers nearly 800 projects, with substantial representation from both popular and previously-unseen codebases.
Semantic Breadth: Inclusion of a long tail of rare CWEs, and strict before/after-fix matching for labeling, facilitates research on generalization and rare weaknesses.
Automatic Abstraction: SCoPE abstraction eliminates superficial identifier bias, better aligning data with the semantics required for robust vulnerability detection.
LLM Utility: The dataset’s form and scale are particularly conducive to LLM training; binary as well as multi-label classification (for CWE) can be directly benchmarked.
Real-World Relevance: Unlike synthetic benchmarks (e.g., Juliet, SATE), DiverseVul is drawn entirely from live Git histories of real software projects, ensuring ecological validity.

However, several open challenges persist (Chen et al., 2023):

False Positive Rates: Even the best models exhibit FPR ≈3.5%, which is operationally problematic at scale.
Cross-project Generalization: Models generalize poorly to unseen projects, with F1 dropping from ~49% (seen) to ~9% (unseen)—a major barrier to deployment in CI pipelines.
Label Noise and Attribution: Vulnerabilities spanning multiple functions, irrelevant changes, and remaining label noise challenge model fidelity. More robust labeling, improved commit attribution, and aggressive deduplication are ongoing needs.

A plausible implication is that while continued data volume growth benefits models within a fixed distribution, achieving practical generalization will require advances in both pretraining objectives and dataset curation.

7. Impact and Future Directions

DiverseVul accelerates empirical study into mechanisms of vulnerability detection, particularly within the context of LLMs. It motivates several research directions:

Enhancement of code-specific pretraining objectives (e.g., identifier masking, data-flow prediction) for improved semantic comprehension.
Development of methods for robust learning under label noise and prevalence imbalance.
Rigorous evaluation protocols that stratify by project or code domain for fair benchmarking of cross-project generalization.
Investigation of techniques for reducing false positives to achieve triage-viable deployment in industry.

DiverseVul is thus a pivotal resource for both method development and benchmarking in machine learning for software security, setting the stage for future progress in real-world vulnerability detection and automated security analysis (Gonçalves et al., 10 Mar 2025, Chen et al., 2023).

Markdown Report Issue Upgrade to Chat

References (2)

Evaluating LLaMA 3.2 for Software Vulnerability Detection (2025)

DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DiverseVul Dataset.

DiverseVul: C/C++ Vulnerability Dataset

1. Curation Methodology and Label Assignment

2. Dataset Statistics and Weakness Coverage

3. Preprocessing, Quality Control, and Refinement

4. File Structure, Metadata, and Example Transformations

5. Model Benchmarking and Empirical Findings

6. Challenges, Comparative Advantages, and Research Directions

7. Impact and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

DiverseVul: C/C++ Vulnerability Dataset

1. Curation Methodology and Label Assignment

2. Dataset Statistics and Weakness Coverage

3. Preprocessing, Quality Control, and Refinement

4. File Structure, Metadata, and Example Transformations

5. Model Benchmarking and Empirical Findings

6. Challenges, Comparative Advantages, and Research Directions

7. Impact and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research