Papers
Topics
Authors
Recent
Search
2000 character limit reached

BenchVul: Balanced Benchmark for Vulnerability Detection

Updated 19 January 2026
  • BenchVul is a balanced benchmark dataset that provides manually validated, real and synthesized samples across the MITRE Top 25 CWEs for robust vulnerability evaluation.
  • It employs a multi-agent LLM-driven pipeline, RVG, to generate realistic synthetic examples that ensure comprehensive CWE coverage and mitigate label noise.
  • Evaluations on BenchVul reveal significant discrepancies between in-distribution and out-of-distribution accuracy, guiding improved performance in real-world vulnerability detection.

BenchVul is a rigorously curated and balanced test dataset focused on enabling fair and robust evaluation of automated vulnerability detection models across the MITRE Top 25 Most Dangerous Common Weakness Enumerations (CWEs). Developed in response to endemic flaws in prior vulnerability benchmarks—specifically, label noise, excessive duplication, and gross inadequacies in real-world CWE coverage—BenchVul and its associated methodologies have become central to empirical research on learning-based software vulnerability analysis (Li et al., 29 Jul 2025).

1. Motivation and Historical Context

Automated vulnerability detection has progressed rapidly, yet progress has been stunted by systematic limitations in benchmarking datasets. Previous studies have documented pervasive issues in common sources such as BigVul and Devign, including label error rates spanning 20% to 71%, redundancy, and an imbalance so severe that a majority of critical CWE types are nearly absent (e.g., a 130:1 ratio between CWE-20 and CWE-798) (Li et al., 29 Jul 2025).

These deficits produce a phenomena in which In-Distribution (ID) accuracy, measured using conventional held-out splits from the same dataset, does not correlate with Out-of-Distribution (OOD) accuracy on real-world vulnerability examples. Therefore, reliable model evaluation demands a benchmark that is both manually verified and CWE-balanced.

BenchVul directly addresses this gap, aiming to eliminate confounding test-set artifacts and focusing research attention on a representative and challenging OOD evaluation scenario.

2. Construction and Content of BenchVul

BenchVul is designed to contain exactly 50 labeled vulnerable samples for each of the MITRE Top 25 CWEs (with minor exceptions for CWEs lacking sufficient real-world instances), using a mix of real and synthesized examples to achieve completeness. Where feasible, each sample is drawn from labeled, real-world software projects; in cases of insufficiency, a realistic synthesis pipeline (see Section 4) fills the deficit to achieve uniformity across categories (Li et al., 29 Jul 2025).

All samples are:

  • Manually labeled and balanced to prevent spurious correlation exploitation.
  • Self-contained at the function level, ensuring evaluation remains decoupled from build systems or external project dependencies.
  • Rigorously validated by direct inspection or multi-agent review (details below).

BenchVul is partitioned into two subsets:

  • BenchVul Real: 190 real, manually validated samples covering 21 CWEs.
  • BenchVul Synth: 860 synthesized samples, filling in underrepresented CWE categories.

3. Evaluation Protocol and Metrics

Models are evaluated on BenchVul using balanced accuracy, reporting both per-CWE and overall performance, and specifically contrasting in-distribution (ID) splits (i.e., from the same dataset as the training set) with OOD accuracy on BenchVul. The methodology is designed to quantify the degree of OOD generalization required for real-world deployment.

A key empirical finding is that ID accuracy is often a misleading proxy for true model utility: e.g., a model trained on BigVul achieves an ID accuracy of 0.703 but only 0.493 OOD on BenchVul; in contrast, models trained on high-quality, balanced datasets such as TitanVul attain 0.881 on BenchVul Real and 0.785 on BenchVul Synth, with further gains following augmentation with synthesized data.

Training Data BenchVul Real BenchVul Synth
TitanVul 0.881 0.785
TitanVul + RVG 0.932 0.888

BenchVul thus provides a discriminative context for measuring genuine CWE generalization rather than dataset-specific overfitting (Li et al., 29 Jul 2025).

4. RVG: Realistic Vulnerability Generation Pipeline

To construct a fully CWE-balanced testbed, BenchVul employs a multi-agent, LLM-orchestrated pipeline termed Realistic Vulnerability Generation (RVG). RVG synthesizes context-aware samples of underrepresented CWEs through an emulated software engineering workflow:

  1. Threat Modeler: Given a CWE, produces a realistic program context and explicit attack vector, ensuring contextual coverage and scenario diversity (FIFO-buffered deduplication).
  2. Vulnerable Implementer: Generates a commented code snippet exhibiting the CWE, with functional comments that do not preempt the vulnerability’s presence.
  3. Security Auditor: Produces a remediated (fixed) variant.
  4. Security Reviewer: Verifies CWE manifestation and correct remediation, filtering out invalid or implausible examples.
  5. Cross-model Validation: A secondary LLM (GPT-4o) independently checks vulnerable/patched pairs for consonance with the labeled CWE.

Sampling is controlled (temperature = 0.3, max_tokens ≈ 512), and up to 100 synthesized samples per CWE are generated to allow for manual/algorithmic filtering to a strict 50 per-category (Li et al., 29 Jul 2025).

RVG is tailored for high realism and consistently validated by both agent consensus and independent LLM review. The pipeline produces substantial gains when its output is used for downstream model training or evaluation.

5. Role in Advancing Benchmarking Practices

BenchVul has become the reference standard for OOD benchmarking in research on vulnerability detection and localization. Its construction—distinctly balanced, contextually rich, and manually validated—enables the following:

  • Superior Generalization Testing: A model’s performance on BenchVul correlates with true effectiveness in security engineering, while ID metrics do not.
  • Per-CWE Diagnostics: Enables detailed analysis of CWE-type weaknesses in models, especially on long-tail and rare classes, where BenchVul often reveals large disparities versus prior benchmarks.
  • Facilitation of Data Generation Research: The need to populate underrepresented CWEs has driven the adoption and advancement of multi-agent and retrieval-augmented synthesis methods (see also RVG in (Li et al., 29 Jul 2025) and agentic/LLM frameworks in (Lbath et al., 28 Aug 2025)).

Evaluation on BenchVul is now standard for validating new vulnerability dataset construction techniques (VGX (Nong et al., 2023), RVG/AVIATOR (Lbath et al., 28 Aug 2025)) and for guiding the development of improved training corpora such as TitanVul.

6. Impact and Current Challenges

BenchVul’s introduction has shifted the community’s perspective from relying on potentially misleading ID metrics toward rigorous OOD evaluation.

Key empirical results include:

  • The highest OOD accuracy on real-world BenchVul samples (0.932) is achieved only after training data is augmented with RVG-generated samples; models trained without such augmentation plateau at lower scores.
  • Largest relative improvements are observed for rare or previously underrepresented CWEs (e.g., CWE-798, Hard-Coded Credentials: 0.587→0.863 when using RVG-augmented data).
  • BenchVul has illuminated the weaknesses of many high-profile vulnerability detectors, demonstrating that apparent SOTA performance is often dataset-dependent rather than generalizable (Li et al., 29 Jul 2025).

Persistent challenges include the high cost of manual validation, scalability to new CWE definitions, and difficulties handling real-world code interdependency and context beyond function scope.

7. Relationship to Synthesis and Injection Methodologies

BenchVul’s completeness and realism are partially enabled by new synthesis pipelines such as RVG and related agentic vulnerability injection frameworks (e.g., AVIATOR (Lbath et al., 28 Aug 2025)). These methods leverage retrieval-augmented generation, agent specialization, and hybrid LLM/static analysis validation to generate high-fidelity, contextually rich vulnerable code, contrasting with prior pattern-based or end-to-end Transformer approaches that yielded high rates of noise and limited CWE diversity (Nong et al., 2023, Lbath et al., 28 Aug 2025).

Recent results indicate that augmenting training with synthetic samples validated on BenchVul outperforms previous synthetic sources (e.g., SARD), establishing an effective new benchmark for both data quality and downstream detection task performance.

Summary Table: BenchVul in Context

Feature BenchVul Prior Datasets
Manual Label Validation Yes (Real and Synth) No/Partial
CWE Coverage (Top 25) Full (50 per CWE, real + synth) Imbalanced/Incomplete
OOD Evaluation Focus Central Rare
Synthesis Framework LLM-based RVG, multi-level validation Pattern/Random/None
Downstream Impact Measurable performance gains, new SOTA Limited OOD generalization

BenchVul’s integration of real and rigorously validated synthetic data, CWE balance, and OOD emphasis collectively advance the empirical foundations of automated software vulnerability research (Li et al., 29 Jul 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BenchVul.