CVE-Bench: Automated Vulnerability Benchmarking

Updated 14 February 2026

CVE-Bench is a family of datasets and frameworks that benchmark automated vulnerability analysis, exploit generation, patch validation, and security tooling using realistic, containerized environments.
It leverages real-world CVEs with rigorous reproducibility, integrating automated quality assurance and detailed metrics like exploitability rate and environment fidelity.
The framework supports diverse applications including AI agent evaluation, static analysis, and dynamic patch testing, thereby advancing reproducible, metric-driven security research.

CVE-Bench is a term encompassing a family of datasets and evaluation frameworks that leverage real-world Common Vulnerabilities and Exposures (CVEs) as the basis for benchmarking automated vulnerability analysis, exploit generation, patch validation, and security tooling. These benchmarks are characterized by high-fidelity reproduction of vulnerabilities in realistic environments, inclusion of verifiable proof-of-concept exploits, and rigorous ground-truth annotation. CVE-Bench evaluations have catalyzed the empirical assessment of AI-driven agents, fuzzers, patching tools, and vulnerability detection systems.

1. Foundational Principles and Framework Variants

CVE-Bench has been instantiated through multiple efforts, each emphasizing distinct operationalizations:

The original CVE-Bench framework for evaluating AI agent exploit capabilities in web applications (Zhu et al., 21 Mar 2025).
CVE-Genie and its CVE-Bench dataset, which automate end-to-end reproduction of CVEs with verifiable exploits in containerized environments (Ullah et al., 1 Sep 2025).
CVE-Factory and LiveCVEBench, a continuously updated, programmatically generated suite for holistic vulnerability repair and patch validation across diverse programming languages (Luo et al., 3 Feb 2026).
Benchmarks targeting CVE-to-CWE classification (Aghaei et al., 2020) or fine-grained affected-version labeling (Chen et al., 4 Sep 2025).

Common to these instances is a commitment to real-world CVE coverage, strict reproducibility, and metric-driven evaluation grounded in security relevance.

2. Benchmark Construction and Data Generation Pipelines

The construction of a CVE-Bench involves multi-stage automation to transition from CVE metadata to executable, evaluable tasks:

Data Collection and Task Definition: CVE metadata is automatically ingested from sources such as the NVD and cvelist, extracting repository URLs, patches, and references to advisories and proof-of-concepts (Ullah et al., 1 Sep 2025, Luo et al., 3 Feb 2026).
Environment Reproduction: Each CVE is reconstructed in an isolated, often containerized, environment using Docker or Docker Compose, ensuring correctness of dependency versions, OS, and application configuration (Zhu et al., 21 Mar 2025, Luo et al., 3 Feb 2026).
Exploit Generation: Where a reference exploit is unavailable, LLM-powered agents or deterministic scripts perform patch-diff inspection, fuzzing, or random input generation to synthesize a minimal PoC that triggers the original vulnerability (Ullah et al., 1 Sep 2025, Luo et al., 3 Feb 2026).
Verifier/Grader Integration: For each task, a grading script or verifier monitors black-box criteria (e.g., flag exfiltration, error logs, side effects), enforcing deterministic success/failure signals without manual intervention (Zhu et al., 21 Mar 2025, Luo et al., 3 Feb 2026).
Automated Quality Assurance: Multi-agent frameworks integrate “critic” or “checker” modules, enacting iterative correction cycles when setup, exploit, or verification steps fail quality gates (Ullah et al., 1 Sep 2025, Luo et al., 3 Feb 2026).

A representative architecture table:

Framework	Task Source	Environment	Exploit	Grading
CVE-Bench	40 web-app CVEs	Docker Compose	Manual/LLM	Task-specific
CVE-Genie	841 multi-lang CVEs	Linux containers	LLM	CTF verifier
LiveCVEBench	190 tasks, 14 langs	Multi-container	LLM	Pytest + scripts

3. Evaluation Methodologies and Metrics

CVE-Bench protocols define rigorous metrics for each evaluation axis:

Exploitability Rate: For agent-driven benchmarks, the principal metric is the proportion of benchmarked CVEs successfully exploited by the tested agent or tool. For example, state-of-the-art frameworks achieve up to 13% success in zero-day web application exploitation scenarios (Zhu et al., 21 Mar 2025).
Solution Correctness Rate ( $\%$ ): Defined as the proportion of tasks where the reference patch passes both vulnerability-trigger and stability tests, e.g., 95% for CVE-Factory (Luo et al., 3 Feb 2026).
Environment Fidelity ( $\%$ ): Measures the concordance between the reconstructed environment and ground-truth expert reproductions, reaching 96% in validated settings (Luo et al., 3 Feb 2026).
Reproduction Success Rate ( $R$ ): As established in CVE-Genie, $R = N_{\text{reproduced}} / N_{\text{total}}$ , with practical rates of 51% on recent real-world CVEs (Ullah et al., 1 Sep 2025).
Resource Efficiency: Cost and compute-time statistics are reported, e.g., $\$2.01$ median LLM-API cost per CVE and 18 minutes end-to-end runtime for CVE-Genie (Ullah et al., 1 Sep 2025).
Fine-Grained Labeling and Error Analysis: Statement-level accuracy, F1, and context extraction precision are tracked in benchmarks like SecVulEval (Ahmed et al., 26 May 2025).

4. Coverage Characteristics and Task Design

CVE-Bench instantiations are distinguished by breadth of language and CWE coverage, exploitation modality, and realism:

Language/Project Scope: LiveCVEBench spans 14 languages and 153 repositories, introducing the broadest coverage currently published for agentic security evaluations (Luo et al., 3 Feb 2026). CVE-Genie covers 841 CVEs across 267 projects and 141 CWEs (Ullah et al., 1 Sep 2025).
Attack Surface Realism: Web-application-centric CVE-Bench tasks deploy actual application stacks, with authentic web servers and databases, enforcing vulnerability on genuine software artifacts (Zhu et al., 21 Mar 2025).
Exploit and Patch Endpoint: Standardized exploit goals (e.g., file exfiltration, admin login) and patch validation scripts test end-to-end mitigation, not just code-level bug presence (Zhu et al., 21 Mar 2025, Luo et al., 3 Feb 2026).
Verification Rigor: Container-level isolation, deterministic scoping (e.g., allowed endpoints), and automated grader services preclude confounding factors and guarantee reproducibility.

5. Results, Comparative Performance, and Insights

Empirical results across CVE-Bench variants reveal both capabilities and limitations:

Exploit Automation: Agents using LLM-driven planning and external tools such as sqlmap achieved between 13% (zero-day) and 25% (one-day) exploit success rates across 40 high-severity web CVEs (Zhu et al., 21 Mar 2025).
Reproduction at Scale: CVE-Genie demonstrated 51% success (428/841 CVEs) in fully automating exploit generation, with robust cost and time profiles. Key ablation results indicate that absence of feedback loops or modular agents sharply degrades pipeline performance (0/15 with monolithic agent) (Ullah et al., 1 Sep 2025).
Environment and Patch Validation: LiveCVEBench, using CVE-Factory pipelines, validated 66.2% of attempted recent vulnerabilities, with Qwen3-32B improving from 5.29% to 35.79% after trajectory fine-tuning, surpassing prior single-agent baselines (Luo et al., 3 Feb 2026).
Annotation Quality: ThreatZoom’s CVE-Bench for CVE-to-CWE mapping achieves up to 92% fine-grained classification accuracy on NVD, demonstrating parity with expert human curators (Aghaei et al., 2020).
Benchmark Rigor: The reference for affected version identification, composed of 1,128 vulnerabilities and 59,187 version labels, constrains state-of-the-art to sub-45% exact accuracy at the vulnerability level (Chen et al., 4 Sep 2025).

6. Limitations, Gaps, and Prospects for Extension

Despite substantial advances, current CVE-Bench frameworks exhibit several limitations:

Domain Coverage: Tasks involving GUI, mobile, and proprietary software remain out of scope; forthcoming work targets headless browsers and specialized environments (Ullah et al., 1 Sep 2025, Luo et al., 3 Feb 2026).
Exploit Diversity: Standardized attack goals and grader logic may omit complex multi-stage, time/channel, or chained exploits (Zhu et al., 21 Mar 2025).
Automation Gaps: Manual effort is still required when PoCs are unavailable or when environment construction fails due to obscure dependencies; ~36% of failures in LiveCVEBench stem from static- or mock-based tests (Luo et al., 3 Feb 2026).
Dynamic Analysis and PoC Realism: Current pipelines are integrating automated dynamic replay (e.g., Burp Suite automation) to more deeply validate exploit conditions (Luo et al., 3 Feb 2026).
Benchmark Evolution: Expansion to additional languages, greater contextual modeling (e.g., real-time repo dependency graphs), and integration of richer static/dynamic analysis outputs are ongoing (Ahmed et al., 26 May 2025, Luo et al., 3 Feb 2026).

7. Significance and Research Impact

CVE-Bench represents a paradigm shift in vulnerability benchmarking for code and security research:

End-to-End Evaluation: Enables empirical comparison of LLM agents, fuzzers, static analyzers, and patch validators in realistic conditions, crucial for progress on automated cyber defense and autonomous red-teaming (Zhu et al., 21 Mar 2025, Ullah et al., 1 Sep 2025, Luo et al., 3 Feb 2026).
Standardization of Security Benchmarks: Facilitates reproducible, objective, and quantitative evaluation cycles across tools and studies, catalyzing research progress and cross-pollination.
Continuous Updating: Automated pipelines such as CVE-Factory and CVE-Genie allow CVE-Bench datasets to track the evolving security landscape, rapidly incorporating emergent threat patterns (Luo et al., 3 Feb 2026, Ullah et al., 1 Sep 2025).
Reduction in Expert Workload: Automated pipelines compress a process formerly constrained by human expertise and laborious setup into scalable, open-source benchmark releases (Aghaei et al., 2020, Luo et al., 3 Feb 2026).

The CVE-Bench family has established foundational infrastructure for cybersecurity research, providing a technical substrate upon which the next generation of security agents and analysis systems are developed and evaluated.