Realistic Vulnerability Generation

Updated 19 January 2026

RVG is an automated system that generates realistic vulnerability samples by mimicking genuine development, attack, and remediation workflows.
It employs a multi-agent pipeline with stages like threat modeling, vulnerable code implementation, patch generation, and cross-model validation to ensure high contextual fidelity.
Empirical results demonstrate enhanced out-of-distribution accuracy and injection success rates, significantly boosting machine learning-based vulnerability detection.

Realistic Vulnerability Generation (RVG) refers to automated systems and computational frameworks designed to synthesize vulnerability examples within software code in a manner that closely mimics genuine development, attack, and remediation workflows. The principal aim is to augment training and evaluation datasets for machine learning-based vulnerability analysis by providing high-quality, realistic, and contextually diverse code samples, especially for Common Weakness Enumeration (CWE) categories that are underrepresented or difficult to mine from real-world repositories. RVG frameworks employ multi-agent LLM orchestration, specialized transformer architectures, retrieval-augmented generation, and rigorous validation routines to ensure both accuracy and domain fidelity. These systems have been shown to significantly improve out-of-distribution generalization for automated vulnerability detectors (Li et al., 29 Jul 2025, Nong et al., 2023, Lbath et al., 28 Aug 2025).

1. Scientific Motivation and Objectives

RVG frameworks are designed to address the profound scarcity and skew in available vulnerability datasets. The MITRE Top 25 Most Dangerous CWEs, for instance, are often unevenly represented in large corpora, with high-risk categories (e.g., CWE-798 "Hard-Coded Credentials") sometimes having as few as 39 verified instances in a 100,000+ sample corpus (Li et al., 29 Jul 2025). Existing datasets commonly suffer from label inaccuracy (20–71%), duplication, and critical gaps that impair the ability of learning-based systems to generalize to unseen vulnerabilities. RVG pipelines generate self-contained, single-function vulnerability/fix pairs for weakly represented or absent categories, balancing per-CWE coverage and addressing the generalization gap by diversifying language, framework, and attack vector scenarios.

Key objectives include:

Synthesizing context-aware vulnerability samples that parallel genuine development and code auditing processes.
Maintaining high sample quality, defined as >90% correctness through combined automatic and human review.
Closing the gap between in-distribution (ID) and out-of-distribution (OOD) performance, enabling reliable assessment and training of vulnerability detectors.

2. Technical Architectures and Workflow Components

RVG systems implement multi-stage, multi-agent workflows. Notable examples include the four-agent RVG pipeline from (Li et al., 29 Jul 2025) and the modular "AVIATOR" framework in (Lbath et al., 28 Aug 2025).

Typical pipeline stages:

Threat Modeler: Receives CWE details, then designs a concrete attack scenario (specifying target language, relevant frameworks or libraries, business functionality, and attack vector).
Vulnerable Implementer: Generates a plausible function implementing the desired feature with a subtle, contextually proper vulnerability matching the target CWE.
Security Auditor: Produces a patch that remediate the injected vulnerability, ensuring operational equivalence and correct CWE alignment.
Security Reviewer: Validates the paired vulnerable/fixed samples with a binary verdict, confirming semantic and technical alignment to the intended CWE.
Cross-Model/Agentic Validation: Independent LLM agents (e.g., Claude-3.7-Sonnet, GPT-4o) cross-check outputs for semantic correctness and fidelity, while additional tools (Cppcheck, ESBMC) ensure structural integrity (Li et al., 29 Jul 2025, Lbath et al., 28 Aug 2025).

Advanced agentic pipelines also incorporate:

Retrieval-Augmented Generation (RAG): Incorporates examples from a knowledge base of aligned benign/vulnerable code to guide realistic code transformations (Lbath et al., 28 Aug 2025).
Fine-tuned LLM transformation modules, leveraging Low-Rank Adaptation (LoRA) to enable efficient and category-specific injection capability.

3. Algorithmic Frameworks and Formal Characterizations

RVG pipelines are formalized via well-defined pseudocode and mathematical notation. The canonical RVG process (adapted from (Li et al., 29 Jul 2025)) is:

$\mathbf{x}_i = (\ell_i, s_i, f_i, r_i)$ 3

Formally, context vectors $\mathbf{x}_i$ encode scenario parameters: $\mathbf{x}_i = (\ell_i, s_i, f_i, r_i)$ where $\ell_i$ = language, $s_i$ = library/stack, $f_i$ = functionality, and $r_i$ = operational role.

Vulnerability injection is modeled as: $\texttt{Vul}(\mathbf{x}_i) = \mathrm{LLM}(\text{prompt}_\text{impl}(\mathbf{x}_i))$ with fix generation: $\texttt{Fix}(\texttt{Vul}(\mathbf{x}_i), \mathbf{x}_i) = \mathrm{LLM}(\text{prompt}_\text{audit}(\texttt{Vul}(\mathbf{x}_i), \mathbf{x}_i))$

Validation criterion: $y_i = \begin{cases} 1 & \text{vulnerability present and mitigated correctly} \ 0 & \text{otherwise} \end{cases}$ Samples are retained iff $y_i = 1$ (Li et al., 29 Jul 2025).

Frameworks such as VGX (Nong et al., 2023) utilize value-flow–based Transformer architectures for vulnerability context localization and apply ranked edit patterns—learned from real fix pairs and refined by human experts—to inject vulnerabilities in code with high precision (Table below).

Component	Approach	Citation
Context Modeling	Value-flow Transformer	(Nong et al., 2023)
Vulnerability Synth	LLM multi-agent pipeline	(Li et al., 29 Jul 2025)
Validation	LLM+static tools + review	(Lbath et al., 28 Aug 2025)

4. Orchestration, Model Tuning, and Agentic Techniques

Agentic orchestration in RVG systems involves discrete task-specialized LLM agents that interact via prompt engineering and shared context queues to maintain diversity and avoid duplicate scenarios. Techniques include:

FIFO context tracking within Threat Modeler to maximize coverage of languages, libraries, and attack vectors.
Cross-model validation employing independent LLMs on each generated vulnerability/fix pair to reduce semantic drift.
Human-in-the-loop verification to benchmark synthetic samples against real-world ones on technical and contextual fidelity.

For LLM-based injection, parameter-efficient tuning is achieved by LoRA (Lbath et al., 28 Aug 2025): $\mathbf{x}_i = (\ell_i, s_i, f_i, r_i)$ 0 where LoRA learns low-rank modifiers $\mathbf{x}_i = (\ell_i, s_i, f_i, r_i)$ 1, preserving base weights $\mathbf{x}_i = (\ell_i, s_i, f_i, r_i)$ 2 for efficient supervised fine-tuning. For continuous improvement, reinforcement learning using group policy optimization (GRPO) with semantic reward (e.g., CodeBLEU) refines model outputs.

Retrieval-augmented generation (RAG) enhances prompt realism by including nearest-neighbor vulnerable/benign code pairs from a knowledge base, indexed via dense vector embeddings, with token-level diff alignment for granular edit localization (Lbath et al., 28 Aug 2025).

5. Dataset Construction, Evaluation Protocols, and Empirical Results

RVG-generated datasets are constructed by:

Selecting secure code functions from large repositories.
Assigning target vulnerabilities (weighted/uniform over CWEs).
Running the injection-validation pipeline and retaining only samples verified both by LLMs and static/formal analysis tools.
Ensuring ABCD criteria: Accurate labeling, Big scale, Credible source, and Diverse coverage.

Evaluation protocols measure:

Balanced accuracy across real-world ("BenchVul Real") and synthetic ("BenchVul Synth") subsets (Li et al., 29 Jul 2025).
Top-1 localization, sample-level precision, recall, F1, and exploitable success rates (Nong et al., 2023).
Injection success rates on standard benchmarks (SARD100, FormAI) (Lbath et al., 28 Aug 2025).

Experimental highlights:

Augmenting TitanVul with RVG yields a real-world OOD accuracy of 0.932 (+5.8 pp gain) and synthetic OOD accuracy of 0.888 (+13.1 pp gain) over baseline vulnerability datasets (Li et al., 29 Jul 2025).
VGX demonstrates a precision of 59.46%, F1 of 32.87%, and 93.02% success on exact test sample matches, achieving 99.09–890.06% F1 gains over prior Injection-based frameworks (Nong et al., 2023).
Multi-agent RVG with SFT achieves 95% injection success on SARD100 and 91% on FormAI (Lbath et al., 28 Aug 2025).

6. Taxonomy of Vulnerability Classes and Representative Examples

RVG frameworks target category-specific CWEs using libraries of edit patterns or token-level transformations. Typical classes covered include buffer overflows (CWE-120), integer overflows (CWE-190), use-after-free (CWE-416), format string vulnerabilities (CWE-134), incorrect buffer size calculations (CWE-131), null pointer dereferences (CWE-476), SQL injection (CWE-89), path traversal (CWE-22), and command injection (CWE-78) (Lbath et al., 28 Aug 2025).

Representative examples from (Li et al., 29 Jul 2025):

CWE-79 (Cross-Site Scripting) in Node.js/Express Vulnerability: $\mathbf{x}_i = (\ell_i, s_i, f_i, r_i)$ 4 Remediation: $\mathbf{x}_i = (\ell_i, s_i, f_i, r_i)$ 5 Reviewer confirms semantic alignment and correct fix.
CWE-798 (Hard-Coded Credentials) in C Vulnerability: $\mathbf{x}_i = (\ell_i, s_i, f_i, r_i)$ 6 Remediation: $\mathbf{x}_i = (\ell_i, s_i, f_i, r_i)$ 7 Reviewer confirms vulnerability and fix validity.

Other frameworks, such as VGX (Nong et al., 2023), extend coverage to memory leak, input validation, and race condition patterns, employing pattern matching and mutation rules derived from mined vulnerability-fix pairs and human expertise.

7. Significance and Field Implications

RVG frameworks—by offering automated, context-rich, and high-fidelity vulnerability injection and dataset curation—enable robust evaluation and training of vulnerability analysis methods that are less prone to overfitting, more generalizable to unseen attack scenarios, and suitable for fine-grained benchmarking across varied CWE types. Empirical evidence demonstrates improvements in OOD accuracy, downstream detection, localization, and repair, while also facilitating the discovery of new CVEs in real codebases (Li et al., 29 Jul 2025, Nong et al., 2023, Lbath et al., 28 Aug 2025). A plausible implication is the elevation of RVG-powered datasets and agentic methodologies to become central components of future vulnerability research pipelines, particularly as security-critical software continues to evolve in complexity and attack surface breadth.

Markdown Report Issue Upgrade to Chat

References (3)

Out of Distribution, Out of Luck: How Well Can LLMs Trained on Vulnerability Datasets Detect Top 25 CWE Weaknesses? (2025)

VGX: Large-Scale Sample Generation for Boosting Learning-Based Software Vulnerability Analyses (2023)

AI Agentic Vulnerability Injection And Transformation with Optimized Reasoning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Realistic Vulnerability Generation (RVG).

Realistic Vulnerability Generation

1. Scientific Motivation and Objectives

2. Technical Architectures and Workflow Components

3. Algorithmic Frameworks and Formal Characterizations

4. Orchestration, Model Tuning, and Agentic Techniques

5. Dataset Construction, Evaluation Protocols, and Empirical Results

6. Taxonomy of Vulnerability Classes and Representative Examples

7. Significance and Field Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Realistic Vulnerability Generation

1. Scientific Motivation and Objectives

2. Technical Architectures and Workflow Components

3. Algorithmic Frameworks and Formal Characterizations

4. Orchestration, Model Tuning, and Agentic Techniques

5. Dataset Construction, Evaluation Protocols, and Empirical Results

6. Taxonomy of Vulnerability Classes and Representative Examples

7. Significance and Field Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research