TitanVul Dataset

Updated 19 January 2026

TitanVul is a large-scale vulnerability dataset that curates 38,548 unique vulnerability-fix pairs to support automated, function-level detection.
It aggregates data from seven sources and utilizes AST-based deduplication along with a multi-agent LLM validation pipeline to ensure high fidelity, achieving a 94% true positive rate.
The dataset offers comprehensive coverage of the MITRE Top 25 CWEs and bridges the generalization gap between In-Distribution and Out-of-Distribution evaluations for robust model benchmarking.

TitanVul is a large-scale, high-quality vulnerability dataset designed for training machine learning models—particularly LLMs—for automated vulnerability detection at the function level. Constructed to address widespread issues such as label inaccuracy, duplication, and insufficient coverage of critical Common Weakness Enumeration (CWE) types, TitanVul aggregates and refines public vulnerability data using a rigorous, multi-stage pipeline incorporating advanced deduplication and multi-agent LLM validation. With 38,548 unique, validated vulnerability-fix function pairs and comprehensive coverage of the MITRE Top 25 Most Dangerous CWEs, TitanVul establishes a new standard in dataset fidelity, generalization, and practical utility for vulnerability research and model benchmarking (Li et al., 29 Jul 2025).

1. Motivation and Rationale

The development of TitanVul is motivated by persistent shortcomings in existing function-level vulnerability datasets:

Label Noise: Prior analyses reveal that 20%–71% of samples in prominent repositories are not bona fide security fixes.
Massive Duplication: Over 60% of function pairs are duplicated or self-identical, providing no genuine learning signal.
Poor Top-25 CWE Coverage: Even the largest aggregated public sets contain extreme skews—certain critical CWEs have only a few dozen examples (up to a 130:1 imbalance, e.g., CWE-20 vs. CWE-798).

These issues result in a substantial generalization gap: models attain artificially high In-Distribution (ID) scores, yet fail on real-world, Out-of-Distribution (OOD) samples by exploiting dataset artifacts rather than learning fundamental vulnerability patterns. TitanVul is explicitly constructed to close this gap by providing validated, deduplicated, and comprehensively annotated vulnerability-fix pairs suitable for robust model training and evaluation (Li et al., 29 Jul 2025).

2. Data Aggregation Pipeline

TitanVul integrates seven publicly available function-level vulnerability datasets:

BigVul
CleanVul
CVEfixes
DiverseVul
PrimeVul
SafeCoder
VulnPatchPairs

The aggregation process standardizes all sources to a unified schema—(vulnerable function, fixed function, commit message, CWE ID)—and incorporates CWE annotations updated via the National Vulnerability Database (NVD) as of December 5, 2024. The multi-stage pipeline includes:

Intra-dataset Deduplication: Removal of exact and self-identical function pairs using AST-based matching.
Merging of Cleaned Datasets: Construction of a superset after individual cleansing.
Inter-dataset Deduplication: Elimination of overlap across sources.
Multi-agent LLM Validation: Filtering out low-quality or non-security commits via a structured agent framework (see Section 3).

The pipeline reduces the original 304,726 pairs from all sources to 38,548 unique, validated examples (Li et al., 29 Jul 2025).

3. Deduplication and Multi-Agent LLM Validation

TitanVul's annotation integrity and deduplication employ two main techniques:

AST-Based Deduplication

Pairs are normalized to Abstract Syntax Trees (ASTs).
Pairs where vulnerable and fixed ASTs are identical are removed (self-identical).
Among exact duplicates, the instance with richer metadata is preserved.

This procedure eliminates 22,807 complete duplicates (7.48%) and 181,183 self-identical pairs (64.28%), narrowing the set to 100,736 pairs for further validation.

Multi-Agent LLM Pipeline

A three-agent system—Auditor, Critic, Consensus—conducts large-scale, code-diff validation. For each function pair:

Auditor: Reviews code diff, message, CWE context, and produces a structured assessment.
Critic: Examines Auditor's evidence, identifies insufficiencies or errors.
Consensus: Synthesizes prior analyses and assigns a possibility score $s \in \{0,1,2,3\}$ . Only pairs with $s \geq 2$ are retained.

A manual audit over 400 random, post-filtered samples confirms a 94% true positive rate: $\mathrm{Validity} = \frac{|\text{Pairs with } s \ge 2|}{|\text{All Pairs}|} \times 100\% = 94\%$ A semantic independence check, using UniXcoder cosine similarity between TitanVul and the BenchVul benchmark, yields all scores in [0.33, 0.38], demonstrating low semantic overlap and thus absence of data leakage (Li et al., 29 Jul 2025).

4. Dataset Composition and Quantitative Properties

TitanVul comprises 38,548 vulnerability-fix pairs, each annotated with:

Function signature (e.g., int foo(char *input))
Vulnerable code snippet
Fixed code snippet
CWE label(s) (single or multi-label)
Commit metadata (optional)

Top 25 CWE Distribution

CWE	Count	% of Total
787	1,846	4.79%
20	1,734	4.50%
119	1,520	3.94%
125	1,432	3.72%
79	968	2.51%
...	...	...
798	16	0.04%
306	5	0.01%

The full distribution covers all MITRE Top 25 CWEs. The proportions reflect significant natural class imbalance, with no further rebalancing performed within TitanVul; rare classes are later augmented through synthetic means as needed.

5. Comparative Analysis with Prior Datasets

Table: Dataset Validity and Deduplication Rates

Dataset	Validity (%)	Duplication/Noise Characteristics	Top 25 CWE Coverage
BigVul	25	94.4% self-identical duplicates	Omits 5–10 types
VulnPatchPairs	36	High duplication	Incomplete
CVEfixes	51.7	Substantial duplicates	Incomplete
DiverseVul	60	Incomplete deduplication	Incomplete
TitanVul	94	87.4% reduction; rigorous deduplication	Complete (Top 25)

TitanVul thus surpasses previous datasets in validity, duplication reduction, and comprehensive coverage across memory, injection, and logic flaws. Its deduplication pipeline produces a corpus with significantly higher semantic integrity, suitable for robust downstream training (Li et al., 29 Jul 2025).

6. Model Evaluation Leveraging TitanVul

Empirical evaluation on both ID and OOD domains demonstrates TitanVul's efficacy. For the strongest model tested (Qwen2.5-Coder-1.5B):

ID accuracy on TitanVul: $\mathrm{Acc}_{\rm ID} = 0.590 \pm 0.003$
OOD accuracy (BenchVul "Real"): $\mathrm{Acc}_{\rm Real} = 0.881 \pm 0.026$
OOD accuracy (BenchVul "Synth"): $\mathrm{Acc}_{\rm Synth} = 0.785 \pm 0.007$

The generalization gap: $\Delta_{\rm Real} = 0.590 - 0.881 = -0.291, \quad \Delta_{\rm Synth} = 0.590 - 0.785 = -0.195$ This trend—modest ID, high OOD—contrasts starkly with outcomes from datasets such as BigVul (ID 0.703 → Real 0.493), and confirms that models trained with TitanVul generalize to previously unseen, manually validated vulnerabilities with high reliability (Li et al., 29 Jul 2025).

7. Applications, Limitations, and Prospective Directions

TitanVul is designed for:

Training function-level vulnerability detection models (notably LLM-based classifiers)
Benchmarking generalization, especially when paired with the balanced BenchVul evaluation corpus
Supplementing rare CWEs via the Realistic Vulnerability Generation (RVG) synthetic-augmentation pipeline

Documented limitations include:

Function-Level Scope: Only single-function vulnerabilities are included; inter-procedural and cross-function cases remain out of scope.
Class Imbalance: The dataset is naturally skewed; RVG supplementation may mitigate but not entirely resolve rare-CWE scarcity.
Synthetic Bias: While RVG outputs undergo GPT-4 cross-validation, some subtle artifacts may persist.
CWE Set Coverage: Initial focus is restricted to the Top 25 CWEs, with future expansions planned for broader coverage and emerging weakness types.

This suggests prioritized advancement areas include deeper inter-procedural analysis, broader weakness coverage, and sophisticated augmentation to further enhance underrepresented vulnerability classes.

In summary, TitanVul provides a rigorously curated, large-scale function-level dataset, combining high validity (94%), exhaustive deduplication, and full Top-25 CWE coverage. Its multi-agent LLM verification strategy and robust deduplication establish TitanVul as a foundation for training and evaluating models capable of superior, real-world generalization in automated vulnerability detection (Li et al., 29 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Out of Distribution, Out of Luck: How Well Can LLMs Trained on Vulnerability Datasets Detect Top 25 CWE Weaknesses? (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TitanVul Dataset.