From CVE Entries to Verifiable Exploits: An Automated Multi-Agent Framework for Reproducing CVEs

Published 1 Sep 2025 in cs.CR | (2509.01835v1)

Abstract: High-quality datasets of real-world vulnerabilities and their corresponding verifiable exploits are crucial resources in software security research. Yet such resources remain scarce, as their creation demands intensive manual effort and deep security expertise. In this paper, we present CVE-GENIE, an automated, LLM-based multi-agent framework designed to reproduce real-world vulnerabilities, provided in Common Vulnerabilities and Exposures (CVE) format, to enable creation of high-quality vulnerability datasets. Given a CVE entry as input, CVE-GENIE gathers the relevant resources of the CVE, automatically reconstructs the vulnerable environment, and (re)produces a verifiable exploit. Our systematic evaluation highlights the efficiency and robustness of CVE-GENIE's design and successfully reproduces approximately 51% (428 of 841) CVEs published in 2024-2025, complete with their verifiable exploits, at an average cost of $2.77 per CVE. Our pipeline offers a robust method to generate reproducible CVE benchmarks, valuable for diverse applications such as fuzzer evaluation, vulnerability patching, and assessing AI's security capabilities.

Abstract PDF Upgrade to Chat

Summary

The paper introduces CVE-Genie, a multi-agent framework leveraging LLMs to automate the reproduction of software vulnerabilities.
The framework decomposes tasks into four modules—Processor, Builder, Exploiter, and CTF Verifier—to systematically create verifiable exploit datasets.
The approach achieved a 428/841 success rate in CVE reproductions, highlighting its potential for enhancing vulnerability detection and security assessments.

From CVE Entries to Verifiable Exploits: An Automated Multi-Agent Framework for Reproducing CVEs

The paper introduces "CVE-Genie", a framework utilizing LLMs in a multi-agent context to automate the reproduction of software vulnerabilities for precise vulnerability datasets creation. This essay presents an in-depth explanation of its architecture, implementation, and potential real-world applications.

CVE-Genie Overview and Architecture

CVE-Genie is designed to reproduce vulnerabilities detailed in CVE entries by utilizing a multi-agent framework that integrates LLMs with various automated stages. It comprises four core components: Processor, Builder, Exploiter, and CTF Verifier, each orchestrating a specific step in the end-to-end reproduction pipeline. The architecture ensures efficient handling of vast CVE data to produce actionable and verifiable exploits.

Figure 1: CVE-Genie Overview.

The framework leverages LLMs for its nuanced capability in SWE tasks. CVE-Genie’s architecture adheres to key principles, including modular task decomposition, robustness against incomplete data, and reliability through self-critique, enabling comprehensive reproduction from sparse CVE data.

Detailed Module Functions

Processor

The Processor module extracts raw data from CVE entries, including source code and security advisories, to create a structured knowledge base. This involves:

Data Processor: Collecting vulnerable versions of software and specific configurations, highlighting source code from public repositories.
Knowledge Builder: Transforming gathered data into a usable format for subsequent modules, ensuring essential CVE details are retained for exploit reproduction.

Builder

This module reconstructs the vulnerable environment using data from the Processor. It involves:

Pre-Requisite Developer Agent: Analyzes project requirements and plans environment setup.
Setup Developer and Critic Agents: Execute setup commands and verify configurations, ensuring the vulnerable environment is operational.

Exploiter

The Exploiter module generates and tests exploits within this configured environment:

Exploit Developer Agent: Uses structured data to recreate or generate exploits.
Exploit Critic Agent: Evaluates and critiques exploit attempts to ensure fidelity and effectiveness against CVE descriptions.

CTF Verifier

Finally, the CTF Verifier ensures the produced exploits reliably reproduce vulnerabilities:

Verifier Developer and Critic Agents: Create and validate verifiers that assess exploit success, ensuring results can be independently confirmed.
Figure 2: CVE-Genie architecture and an end-to-end example of workflow of reproduction for CVE-2024-4340, i.e., Denial of Service due to RecursionError in sqlparse < v0.5.0.

Implementation Considerations

CVE-Genie's implementation considers the following:

Computational Requirements and Trade-offs

Resource Efficiency: Highly context-dependent agents help manage extensive requirements, enhancing task-specific LLM adaptation.
Error Handling: Robust feedback mechanisms allow iterative improvements, essential for handling complex open-source environments and incomplete advisories.

Performance Metrics

CVE-Genie successfully reproduced 428 out of 841 CVEs, across diverse programming languages and projects, demonstrating significant efficiency. Performance metrics revealed that web vulnerabilities were more often successfully reproduced compared to memory-safety issues in system dependencies.

Scaling and Future Improvements

Scaling considerations include enhancing UI interaction for CVEs involving web interfaces and integrating multimodal data processing. Further research will address critics' over-stringency and explore cost optimization for broader usability.

Application in AI and Security

CVE-Genie offers immense potential across various applications:

Vulnerability Detection: Provides datasets for training and benchmark testing ML models.
Software Security Evaluation: Facilitates rigorous testing of patching efforts and secure code generation capabilities.
AI-assisted Development: Enhances penetration testing and attack detection through feasible recreation of complex exploit chains.

Conclusion

CVE-Genie represents a significant advancement in automated CVE reproduction, leveraging LLMs in a multi-agent framework to rapidly create high-quality, reproducible vulnerability datasets. It addresses data scarcity issues, significantly aiding in automated vulnerability assessment and prediction tools research. Future adaptations will explore integrating broader context detection and multimodal capabilities to enhance further its reproduction scope and reliability.

Markdown Report Issue