TitanVul Dataset
- TitanVul is a large-scale vulnerability dataset that curates 38,548 unique vulnerability-fix pairs to support automated, function-level detection.
- It aggregates data from seven sources and utilizes AST-based deduplication along with a multi-agent LLM validation pipeline to ensure high fidelity, achieving a 94% true positive rate.
- The dataset offers comprehensive coverage of the MITRE Top 25 CWEs and bridges the generalization gap between In-Distribution and Out-of-Distribution evaluations for robust model benchmarking.
TitanVul is a large-scale, high-quality vulnerability dataset designed for training machine learning models—particularly LLMs—for automated vulnerability detection at the function level. Constructed to address widespread issues such as label inaccuracy, duplication, and insufficient coverage of critical Common Weakness Enumeration (CWE) types, TitanVul aggregates and refines public vulnerability data using a rigorous, multi-stage pipeline incorporating advanced deduplication and multi-agent LLM validation. With 38,548 unique, validated vulnerability-fix function pairs and comprehensive coverage of the MITRE Top 25 Most Dangerous CWEs, TitanVul establishes a new standard in dataset fidelity, generalization, and practical utility for vulnerability research and model benchmarking (Li et al., 29 Jul 2025).
1. Motivation and Rationale
The development of TitanVul is motivated by persistent shortcomings in existing function-level vulnerability datasets:
- Label Noise: Prior analyses reveal that 20%–71% of samples in prominent repositories are not bona fide security fixes.
- Massive Duplication: Over 60% of function pairs are duplicated or self-identical, providing no genuine learning signal.
- Poor Top-25 CWE Coverage: Even the largest aggregated public sets contain extreme skews—certain critical CWEs have only a few dozen examples (up to a 130:1 imbalance, e.g., CWE-20 vs. CWE-798).
These issues result in a substantial generalization gap: models attain artificially high In-Distribution (ID) scores, yet fail on real-world, Out-of-Distribution (OOD) samples by exploiting dataset artifacts rather than learning fundamental vulnerability patterns. TitanVul is explicitly constructed to close this gap by providing validated, deduplicated, and comprehensively annotated vulnerability-fix pairs suitable for robust model training and evaluation (Li et al., 29 Jul 2025).
2. Data Aggregation Pipeline
TitanVul integrates seven publicly available function-level vulnerability datasets:
- BigVul
- CleanVul
- CVEfixes
- DiverseVul
- PrimeVul
- SafeCoder
- VulnPatchPairs
The aggregation process standardizes all sources to a unified schema—(vulnerable function, fixed function, commit message, CWE ID)—and incorporates CWE annotations updated via the National Vulnerability Database (NVD) as of December 5, 2024. The multi-stage pipeline includes:
- Intra-dataset Deduplication: Removal of exact and self-identical function pairs using AST-based matching.
- Merging of Cleaned Datasets: Construction of a superset after individual cleansing.
- Inter-dataset Deduplication: Elimination of overlap across sources.
- Multi-agent LLM Validation: Filtering out low-quality or non-security commits via a structured agent framework (see Section 3).
The pipeline reduces the original 304,726 pairs from all sources to 38,548 unique, validated examples (Li et al., 29 Jul 2025).
3. Deduplication and Multi-Agent LLM Validation
TitanVul's annotation integrity and deduplication employ two main techniques:
AST-Based Deduplication
- Pairs are normalized to Abstract Syntax Trees (ASTs).
- Pairs where vulnerable and fixed ASTs are identical are removed (self-identical).
- Among exact duplicates, the instance with richer metadata is preserved.
This procedure eliminates 22,807 complete duplicates (7.48%) and 181,183 self-identical pairs (64.28%), narrowing the set to 100,736 pairs for further validation.
Multi-Agent LLM Pipeline
A three-agent system—Auditor, Critic, Consensus—conducts large-scale, code-diff validation. For each function pair:
- Auditor: Reviews code diff, message, CWE context, and produces a structured assessment.
- Critic: Examines Auditor's evidence, identifies insufficiencies or errors.
- Consensus: Synthesizes prior analyses and assigns a possibility score . Only pairs with are retained.
A manual audit over 400 random, post-filtered samples confirms a 94% true positive rate: A semantic independence check, using UniXcoder cosine similarity between TitanVul and the BenchVul benchmark, yields all scores in [0.33, 0.38], demonstrating low semantic overlap and thus absence of data leakage (Li et al., 29 Jul 2025).
4. Dataset Composition and Quantitative Properties
TitanVul comprises 38,548 vulnerability-fix pairs, each annotated with:
- Function signature (e.g.,
int foo(char *input)) - Vulnerable code snippet
- Fixed code snippet
- CWE label(s) (single or multi-label)
- Commit metadata (optional)
Top 25 CWE Distribution
| CWE | Count | % of Total |
|---|---|---|
| 787 | 1,846 | 4.79% |
| 20 | 1,734 | 4.50% |
| 119 | 1,520 | 3.94% |
| 125 | 1,432 | 3.72% |
| 79 | 968 | 2.51% |
| ... | ... | ... |
| 798 | 16 | 0.04% |
| 306 | 5 | 0.01% |
The full distribution covers all MITRE Top 25 CWEs. The proportions reflect significant natural class imbalance, with no further rebalancing performed within TitanVul; rare classes are later augmented through synthetic means as needed.
5. Comparative Analysis with Prior Datasets
Table: Dataset Validity and Deduplication Rates
| Dataset | Validity (%) | Duplication/Noise Characteristics | Top 25 CWE Coverage |
|---|---|---|---|
| BigVul | 25 | 94.4% self-identical duplicates | Omits 5–10 types |
| VulnPatchPairs | 36 | High duplication | Incomplete |
| CVEfixes | 51.7 | Substantial duplicates | Incomplete |
| DiverseVul | 60 | Incomplete deduplication | Incomplete |
| TitanVul | 94 | 87.4% reduction; rigorous deduplication | Complete (Top 25) |
TitanVul thus surpasses previous datasets in validity, duplication reduction, and comprehensive coverage across memory, injection, and logic flaws. Its deduplication pipeline produces a corpus with significantly higher semantic integrity, suitable for robust downstream training (Li et al., 29 Jul 2025).
6. Model Evaluation Leveraging TitanVul
Empirical evaluation on both ID and OOD domains demonstrates TitanVul's efficacy. For the strongest model tested (Qwen2.5-Coder-1.5B):
- ID accuracy on TitanVul:
- OOD accuracy (BenchVul "Real"):
- OOD accuracy (BenchVul "Synth"):
The generalization gap: This trend—modest ID, high OOD—contrasts starkly with outcomes from datasets such as BigVul (ID 0.703 → Real 0.493), and confirms that models trained with TitanVul generalize to previously unseen, manually validated vulnerabilities with high reliability (Li et al., 29 Jul 2025).
7. Applications, Limitations, and Prospective Directions
TitanVul is designed for:
- Training function-level vulnerability detection models (notably LLM-based classifiers)
- Benchmarking generalization, especially when paired with the balanced BenchVul evaluation corpus
- Supplementing rare CWEs via the Realistic Vulnerability Generation (RVG) synthetic-augmentation pipeline
Documented limitations include:
- Function-Level Scope: Only single-function vulnerabilities are included; inter-procedural and cross-function cases remain out of scope.
- Class Imbalance: The dataset is naturally skewed; RVG supplementation may mitigate but not entirely resolve rare-CWE scarcity.
- Synthetic Bias: While RVG outputs undergo GPT-4 cross-validation, some subtle artifacts may persist.
- CWE Set Coverage: Initial focus is restricted to the Top 25 CWEs, with future expansions planned for broader coverage and emerging weakness types.
This suggests prioritized advancement areas include deeper inter-procedural analysis, broader weakness coverage, and sophisticated augmentation to further enhance underrepresented vulnerability classes.
In summary, TitanVul provides a rigorously curated, large-scale function-level dataset, combining high validity (94%), exhaustive deduplication, and full Top-25 CWE coverage. Its multi-agent LLM verification strategy and robust deduplication establish TitanVul as a foundation for training and evaluating models capable of superior, real-world generalization in automated vulnerability detection (Li et al., 29 Jul 2025).