Papers
Topics
Authors
Recent
Search
2000 character limit reached

TitanVul Dataset

Updated 19 January 2026
  • TitanVul is a large-scale vulnerability dataset that curates 38,548 unique vulnerability-fix pairs to support automated, function-level detection.
  • It aggregates data from seven sources and utilizes AST-based deduplication along with a multi-agent LLM validation pipeline to ensure high fidelity, achieving a 94% true positive rate.
  • The dataset offers comprehensive coverage of the MITRE Top 25 CWEs and bridges the generalization gap between In-Distribution and Out-of-Distribution evaluations for robust model benchmarking.

TitanVul is a large-scale, high-quality vulnerability dataset designed for training machine learning models—particularly LLMs—for automated vulnerability detection at the function level. Constructed to address widespread issues such as label inaccuracy, duplication, and insufficient coverage of critical Common Weakness Enumeration (CWE) types, TitanVul aggregates and refines public vulnerability data using a rigorous, multi-stage pipeline incorporating advanced deduplication and multi-agent LLM validation. With 38,548 unique, validated vulnerability-fix function pairs and comprehensive coverage of the MITRE Top 25 Most Dangerous CWEs, TitanVul establishes a new standard in dataset fidelity, generalization, and practical utility for vulnerability research and model benchmarking (Li et al., 29 Jul 2025).

1. Motivation and Rationale

The development of TitanVul is motivated by persistent shortcomings in existing function-level vulnerability datasets:

  • Label Noise: Prior analyses reveal that 20%–71% of samples in prominent repositories are not bona fide security fixes.
  • Massive Duplication: Over 60% of function pairs are duplicated or self-identical, providing no genuine learning signal.
  • Poor Top-25 CWE Coverage: Even the largest aggregated public sets contain extreme skews—certain critical CWEs have only a few dozen examples (up to a 130:1 imbalance, e.g., CWE-20 vs. CWE-798).

These issues result in a substantial generalization gap: models attain artificially high In-Distribution (ID) scores, yet fail on real-world, Out-of-Distribution (OOD) samples by exploiting dataset artifacts rather than learning fundamental vulnerability patterns. TitanVul is explicitly constructed to close this gap by providing validated, deduplicated, and comprehensively annotated vulnerability-fix pairs suitable for robust model training and evaluation (Li et al., 29 Jul 2025).

2. Data Aggregation Pipeline

TitanVul integrates seven publicly available function-level vulnerability datasets:

  • BigVul
  • CleanVul
  • CVEfixes
  • DiverseVul
  • PrimeVul
  • SafeCoder
  • VulnPatchPairs

The aggregation process standardizes all sources to a unified schema—(vulnerable function, fixed function, commit message, CWE ID)—and incorporates CWE annotations updated via the National Vulnerability Database (NVD) as of December 5, 2024. The multi-stage pipeline includes:

  1. Intra-dataset Deduplication: Removal of exact and self-identical function pairs using AST-based matching.
  2. Merging of Cleaned Datasets: Construction of a superset after individual cleansing.
  3. Inter-dataset Deduplication: Elimination of overlap across sources.
  4. Multi-agent LLM Validation: Filtering out low-quality or non-security commits via a structured agent framework (see Section 3).

The pipeline reduces the original 304,726 pairs from all sources to 38,548 unique, validated examples (Li et al., 29 Jul 2025).

3. Deduplication and Multi-Agent LLM Validation

TitanVul's annotation integrity and deduplication employ two main techniques:

AST-Based Deduplication

  • Pairs are normalized to Abstract Syntax Trees (ASTs).
  • Pairs where vulnerable and fixed ASTs are identical are removed (self-identical).
  • Among exact duplicates, the instance with richer metadata is preserved.

This procedure eliminates 22,807 complete duplicates (7.48%) and 181,183 self-identical pairs (64.28%), narrowing the set to 100,736 pairs for further validation.

Multi-Agent LLM Pipeline

A three-agent system—Auditor, Critic, Consensus—conducts large-scale, code-diff validation. For each function pair:

  • Auditor: Reviews code diff, message, CWE context, and produces a structured assessment.
  • Critic: Examines Auditor's evidence, identifies insufficiencies or errors.
  • Consensus: Synthesizes prior analyses and assigns a possibility score s∈{0,1,2,3}s \in \{0,1,2,3\}. Only pairs with s≥2s \geq 2 are retained.

A manual audit over 400 random, post-filtered samples confirms a 94% true positive rate: Validity=∣Pairs with s≥2∣∣All Pairs∣×100%=94%\mathrm{Validity} = \frac{|\text{Pairs with } s \ge 2|}{|\text{All Pairs}|} \times 100\% = 94\% A semantic independence check, using UniXcoder cosine similarity between TitanVul and the BenchVul benchmark, yields all scores in [0.33, 0.38], demonstrating low semantic overlap and thus absence of data leakage (Li et al., 29 Jul 2025).

4. Dataset Composition and Quantitative Properties

TitanVul comprises 38,548 vulnerability-fix pairs, each annotated with:

  • Function signature (e.g., int foo(char *input))
  • Vulnerable code snippet
  • Fixed code snippet
  • CWE label(s) (single or multi-label)
  • Commit metadata (optional)

Top 25 CWE Distribution

CWE Count % of Total
787 1,846 4.79%
20 1,734 4.50%
119 1,520 3.94%
125 1,432 3.72%
79 968 2.51%
... ... ...
798 16 0.04%
306 5 0.01%

The full distribution covers all MITRE Top 25 CWEs. The proportions reflect significant natural class imbalance, with no further rebalancing performed within TitanVul; rare classes are later augmented through synthetic means as needed.

5. Comparative Analysis with Prior Datasets

Table: Dataset Validity and Deduplication Rates

Dataset Validity (%) Duplication/Noise Characteristics Top 25 CWE Coverage
BigVul 25 94.4% self-identical duplicates Omits 5–10 types
VulnPatchPairs 36 High duplication Incomplete
CVEfixes 51.7 Substantial duplicates Incomplete
DiverseVul 60 Incomplete deduplication Incomplete
TitanVul 94 87.4% reduction; rigorous deduplication Complete (Top 25)

TitanVul thus surpasses previous datasets in validity, duplication reduction, and comprehensive coverage across memory, injection, and logic flaws. Its deduplication pipeline produces a corpus with significantly higher semantic integrity, suitable for robust downstream training (Li et al., 29 Jul 2025).

6. Model Evaluation Leveraging TitanVul

Empirical evaluation on both ID and OOD domains demonstrates TitanVul's efficacy. For the strongest model tested (Qwen2.5-Coder-1.5B):

  • ID accuracy on TitanVul: AccID=0.590±0.003\mathrm{Acc}_{\rm ID} = 0.590 \pm 0.003
  • OOD accuracy (BenchVul "Real"): AccReal=0.881±0.026\mathrm{Acc}_{\rm Real} = 0.881 \pm 0.026
  • OOD accuracy (BenchVul "Synth"): AccSynth=0.785±0.007\mathrm{Acc}_{\rm Synth} = 0.785 \pm 0.007

The generalization gap: ΔReal=0.590−0.881=−0.291,ΔSynth=0.590−0.785=−0.195\Delta_{\rm Real} = 0.590 - 0.881 = -0.291, \quad \Delta_{\rm Synth} = 0.590 - 0.785 = -0.195 This trend—modest ID, high OOD—contrasts starkly with outcomes from datasets such as BigVul (ID 0.703 → Real 0.493), and confirms that models trained with TitanVul generalize to previously unseen, manually validated vulnerabilities with high reliability (Li et al., 29 Jul 2025).

7. Applications, Limitations, and Prospective Directions

TitanVul is designed for:

  • Training function-level vulnerability detection models (notably LLM-based classifiers)
  • Benchmarking generalization, especially when paired with the balanced BenchVul evaluation corpus
  • Supplementing rare CWEs via the Realistic Vulnerability Generation (RVG) synthetic-augmentation pipeline

Documented limitations include:

  • Function-Level Scope: Only single-function vulnerabilities are included; inter-procedural and cross-function cases remain out of scope.
  • Class Imbalance: The dataset is naturally skewed; RVG supplementation may mitigate but not entirely resolve rare-CWE scarcity.
  • Synthetic Bias: While RVG outputs undergo GPT-4 cross-validation, some subtle artifacts may persist.
  • CWE Set Coverage: Initial focus is restricted to the Top 25 CWEs, with future expansions planned for broader coverage and emerging weakness types.

This suggests prioritized advancement areas include deeper inter-procedural analysis, broader weakness coverage, and sophisticated augmentation to further enhance underrepresented vulnerability classes.

In summary, TitanVul provides a rigorously curated, large-scale function-level dataset, combining high validity (94%), exhaustive deduplication, and full Top-25 CWE coverage. Its multi-agent LLM verification strategy and robust deduplication establish TitanVul as a foundation for training and evaluating models capable of superior, real-world generalization in automated vulnerability detection (Li et al., 29 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TitanVul Dataset.