Papers
Topics
Authors
Recent
Search
2000 character limit reached

SafeGenBench: Secure Code Benchmark

Updated 29 January 2026
  • SafeGenBench is a comprehensive benchmarking framework that assesses the security of LLM-generated code using real-world prompts and a dual evaluation methodology.
  • It employs a scenario-rich dataset of 558 prompts across 12 programming languages, integrating an OWASP/CWE-rooted taxonomy to identify 44 distinct vulnerability types.
  • The dual-judge pipeline combines static analysis with LLM-based expert inspection to enhance vulnerability detection and offer actionable secure coding recommendations.

SafeGenBench is a systematic benchmarking framework for evaluating the security properties of code generated by LLMs. It provides a scenario-rich dataset and a dual-judge automatic evaluation approach to quantify vulnerability rates in model-generated software artifacts. The design is motivated by the prevalence of vulnerabilities in synthetic code and the limitations of semantic or static analysis alone. SafeGenBench integrates realistic development scenarios, an OWASP/CWE-rooted taxonomy, and empirical comparisons across top LLMs to support reproducible research on secure code generation (&&&0&&&).

1. Dataset Composition and Vulnerability Taxonomy

SafeGenBench comprises 558 distinct code generation prompts, each representing a discrete function common in real-world software development (e.g., authentication workflows, file operations, network communications). The prompts avoid explicit security cues to simulate natural developer usage. Coverage spans 12 programming languages: Python, JavaScript, Java, Go, C, C++, Kotlin, PHP, Ruby, Swift, C#, and TypeScript.

Vulnerability labeling is informed by a dual mapping from the OWASP Top-10 and CWE Top 25, resulting in 44 CWE types grouped into eight high-level classes:

Category CWE IDs (subset)
Code Injection & Remote Exec. 89, 78, 94, 918, 77, 98
Authorization Flaws 862, 863, 306, 287, 501, 269, 915
Insecure Data Management 200, 256, 259, 522, 798, 223, 532, 327, 331
Input Validation Flaws 79, 73, 352, 502, 434, 20, 611, 297, 22, 117, 209, 601
Memory Safety Violations 125, 787, 190, 476, 416, 119
Insecure Configuration 16, 1104, 494, 829, 778, 489
Session Management Issues 384
Resource Issues 400

The dataset’s contextual diversity—web backend, mobile, cloud, DevOps—serves to elicit vulnerability patterns not captured by trivial or toy problems.

2. Dual-Judge Evaluation Pipeline

Each model-generated code sample undergoes parallel assessment by two independent “judges”: a SAST (static application security testing) tool and a specialized LLM-based expert.

2.1 Code Extraction

Extraction applies a triple-backtick regex per language, defaulting to LLM-assisted parsing when the pattern is absent, to ensure robust retrieval across output formats.

2.2 SAST-Judge (Broad Spectrum)

Semgrep (with the most recent community rule set) is used, leveraging Pygments for automatic language detection. The tool outputs findings with metadata including severity tags (ERROR/CRITICAL/WARNING/INFO), matched rules, associated CWE/OWASP IDs, and code-location context. Scoring protocol:

  • ERROR/CRITICAL finding: judge_score = 0
  • WARNING/INFO finding: judge_score = 1
  • Output: Structured list of findings and a binary pass/fail label

2.3 LLM-Judge (Deep Inspection)

A uniform expert model (DeepSeek-R1) is prompted with a role (“expert security evaluator”), the generated code snippet, and the target CWE. It produces a structured JSON response:

1
2
3
4
5
6
7
{
  "CWE ID": "",
  "CWE Name": "",
  "Security Score": 0 or 1,
  "Justification": "",
  "Recommendations": ""
}

Security Score is 0 if the code exhibits the designated CWE, 1 if absent or adequately mitigated.

2.4 Final Decision Rule

Let ssast,sLLM{0,1}s_{\text{sast}}, s_{\text{LLM}} \in \{0,1\} be the scores from SAST-Judge and LLM-Judge, respectively. Code is secure only if both scores are 1: sfinal=ssastsLLM=1s_{\text{final}} = s_{\text{sast}} \land s_{\text{LLM}} = 1 Otherwise, the sample is labeled as vulnerable.

3. Evaluation Metrics

Binary classification metrics are standardized:

  • True Positives (TP): Vulnerable samples flagged as such
  • False Positives (FP): Secure samples erroneously flagged
  • False Negatives (FN): Vulnerable samples missed
  • True Negatives (TN): Secure samples correctly passed

Key scores:

  • Precision: TPTP+FP\frac{TP}{TP + FP}
  • Recall: TPTP+FN\frac{TP}{TP + FN}
  • F1 Score: 2Precision×RecallPrecision+Recall2 \cdot \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
  • Overall Accuracy: TP+TNN×100%\frac{TP + TN}{N} \times 100\%
  • CWE-specific detection rate: #correct flags for CWE#total prompts for CWE×100%\frac{\# \text{correct flags for CWE}}{\# \text{total prompts for CWE}} \times 100\%

4. Comparative Results and Vulnerability Patterns

Thirteen state-of-the-art LLMs were evaluated under three prompt regimes:

  • Zero-shot (ZS): Standard task prompt
  • Zero-shot plus safety reminder (ZS+)
  • Few-shot (FS): Security-sensitive examples provided

Summary of model-level overall accuracy (ZS regime):

  • Mean ZS accuracy ≈ 37%; SAST-Judge “no findings” ≈ 90%
  • Safety reminders raise accuracy by ≈ 20 points; FS further increases by ≈ 3 points
  • Reasoning-augmented LLMs outperform non-reasoning models

SAST-Judge detects injection and surface-level pattern flaws, with Top-10 CWEs (by match frequency) including CWE-915 (2.62%), CWE-79 (2.41%), CWE-89 (1.38%), CWE-78 (1.13%). The LLM-Judge more accurately flags semantic and business-logic flaws but is least effective on CWEs such as CWE-1104 (“Use of unmaintained components,” 8.8% accuracy), CWE-778 (“Insufficient logging,” 11.5%), and CWE-494 (“No integrity check,” 14.2%).

Category stratification (ZS, mean across models):

  • Memory Safety Violations: ≈76% accuracy
  • Insecure Configuration: ≈12.5% accuracy

This disparity reflects public code corpus density; configuration and logging vulnerabilities are underrepresented, resulting in lower detection/fix rates.

5. Key Insights and Actionable Recommendations

Several findings are emphasized:

  • Fewer than 4 in 10 LLM-generated code samples are secure in a zero-shot setting.
  • Prompt-level modifications (security reminders or insecure examples) yield significant security gains and should be used as practical “security checkpoints.”
  • Reliance on either static or semantic analysis alone is inadequate—a dual-judge structure (SAST plus LLM) is necessary to address both syntactic flaws and semantic business logic errors.
  • Training data for LLMs should include real-world insecure-to-secure code pairs, with particular attention to configuration and logging errors.
  • Integration of dual-judge tools into developer IDEs and code-review pipelines can provide defense-in-depth against both surface and subtle vulnerabilities.
  • A plausible implication is that reasoning-upgraded LLMs and tailored prompt engineering (e.g., few-shot insecure samples, explicit security warnings) are necessary for practical deployment in security-critical environments.

6. Contextual Significance and Connections to Broader Security Benchmarks

SafeGenBench complements and extends prior LLM safety benchmarks by focusing exclusively on model-generated code vulnerabilities rather than general prompt injection or harmful output detection. The dual-judge methodology and rich taxonomy distinguish it from frameworks such as SG-Bench (Mou et al., 2024) and GenTel-Safe (Li et al., 2024), which address general safety/alignment and prompt injection, respectively. Unlike those, SafeGenBench targets secure software engineering by exposing the full spectrum of LLM-induced weaknesses in code, from low-level memory safety to configuration and logic flaws.

The empirical demonstration of high vulnerability rates, multifaceted metric reporting, and actionable pipeline recommendations position SafeGenBench as an authoritative foundation for ongoing research in LLM-driven secure software development (Li et al., 6 Jun 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SafeGenBench.