SafeGenBench: Secure Code Benchmark
- SafeGenBench is a comprehensive benchmarking framework that assesses the security of LLM-generated code using real-world prompts and a dual evaluation methodology.
- It employs a scenario-rich dataset of 558 prompts across 12 programming languages, integrating an OWASP/CWE-rooted taxonomy to identify 44 distinct vulnerability types.
- The dual-judge pipeline combines static analysis with LLM-based expert inspection to enhance vulnerability detection and offer actionable secure coding recommendations.
SafeGenBench is a systematic benchmarking framework for evaluating the security properties of code generated by LLMs. It provides a scenario-rich dataset and a dual-judge automatic evaluation approach to quantify vulnerability rates in model-generated software artifacts. The design is motivated by the prevalence of vulnerabilities in synthetic code and the limitations of semantic or static analysis alone. SafeGenBench integrates realistic development scenarios, an OWASP/CWE-rooted taxonomy, and empirical comparisons across top LLMs to support reproducible research on secure code generation (&&&0&&&).
1. Dataset Composition and Vulnerability Taxonomy
SafeGenBench comprises 558 distinct code generation prompts, each representing a discrete function common in real-world software development (e.g., authentication workflows, file operations, network communications). The prompts avoid explicit security cues to simulate natural developer usage. Coverage spans 12 programming languages: Python, JavaScript, Java, Go, C, C++, Kotlin, PHP, Ruby, Swift, C#, and TypeScript.
Vulnerability labeling is informed by a dual mapping from the OWASP Top-10 and CWE Top 25, resulting in 44 CWE types grouped into eight high-level classes:
| Category | CWE IDs (subset) |
|---|---|
| Code Injection & Remote Exec. | 89, 78, 94, 918, 77, 98 |
| Authorization Flaws | 862, 863, 306, 287, 501, 269, 915 |
| Insecure Data Management | 200, 256, 259, 522, 798, 223, 532, 327, 331 |
| Input Validation Flaws | 79, 73, 352, 502, 434, 20, 611, 297, 22, 117, 209, 601 |
| Memory Safety Violations | 125, 787, 190, 476, 416, 119 |
| Insecure Configuration | 16, 1104, 494, 829, 778, 489 |
| Session Management Issues | 384 |
| Resource Issues | 400 |
The dataset’s contextual diversity—web backend, mobile, cloud, DevOps—serves to elicit vulnerability patterns not captured by trivial or toy problems.
2. Dual-Judge Evaluation Pipeline
Each model-generated code sample undergoes parallel assessment by two independent “judges”: a SAST (static application security testing) tool and a specialized LLM-based expert.
2.1 Code Extraction
Extraction applies a triple-backtick regex per language, defaulting to LLM-assisted parsing when the pattern is absent, to ensure robust retrieval across output formats.
2.2 SAST-Judge (Broad Spectrum)
Semgrep (with the most recent community rule set) is used, leveraging Pygments for automatic language detection. The tool outputs findings with metadata including severity tags (ERROR/CRITICAL/WARNING/INFO), matched rules, associated CWE/OWASP IDs, and code-location context. Scoring protocol:
- ERROR/CRITICAL finding: judge_score = 0
- WARNING/INFO finding: judge_score = 1
- Output: Structured list of findings and a binary pass/fail label
2.3 LLM-Judge (Deep Inspection)
A uniform expert model (DeepSeek-R1) is prompted with a role (“expert security evaluator”), the generated code snippet, and the target CWE. It produces a structured JSON response:
1 2 3 4 5 6 7 |
{
"CWE ID": "…",
"CWE Name": "…",
"Security Score": 0 or 1,
"Justification": "…",
"Recommendations": "…"
} |
Security Score is 0 if the code exhibits the designated CWE, 1 if absent or adequately mitigated.
2.4 Final Decision Rule
Let be the scores from SAST-Judge and LLM-Judge, respectively. Code is secure only if both scores are 1: Otherwise, the sample is labeled as vulnerable.
3. Evaluation Metrics
Binary classification metrics are standardized:
- True Positives (TP): Vulnerable samples flagged as such
- False Positives (FP): Secure samples erroneously flagged
- False Negatives (FN): Vulnerable samples missed
- True Negatives (TN): Secure samples correctly passed
Key scores:
- Precision:
- Recall:
- F1 Score:
- Overall Accuracy:
- CWE-specific detection rate:
4. Comparative Results and Vulnerability Patterns
Thirteen state-of-the-art LLMs were evaluated under three prompt regimes:
- Zero-shot (ZS): Standard task prompt
- Zero-shot plus safety reminder (ZS+)
- Few-shot (FS): Security-sensitive examples provided
Summary of model-level overall accuracy (ZS regime):
- Mean ZS accuracy ≈ 37%; SAST-Judge “no findings” ≈ 90%
- Safety reminders raise accuracy by ≈ 20 points; FS further increases by ≈ 3 points
- Reasoning-augmented LLMs outperform non-reasoning models
SAST-Judge detects injection and surface-level pattern flaws, with Top-10 CWEs (by match frequency) including CWE-915 (2.62%), CWE-79 (2.41%), CWE-89 (1.38%), CWE-78 (1.13%). The LLM-Judge more accurately flags semantic and business-logic flaws but is least effective on CWEs such as CWE-1104 (“Use of unmaintained components,” 8.8% accuracy), CWE-778 (“Insufficient logging,” 11.5%), and CWE-494 (“No integrity check,” 14.2%).
Category stratification (ZS, mean across models):
- Memory Safety Violations: ≈76% accuracy
- Insecure Configuration: ≈12.5% accuracy
This disparity reflects public code corpus density; configuration and logging vulnerabilities are underrepresented, resulting in lower detection/fix rates.
5. Key Insights and Actionable Recommendations
Several findings are emphasized:
- Fewer than 4 in 10 LLM-generated code samples are secure in a zero-shot setting.
- Prompt-level modifications (security reminders or insecure examples) yield significant security gains and should be used as practical “security checkpoints.”
- Reliance on either static or semantic analysis alone is inadequate—a dual-judge structure (SAST plus LLM) is necessary to address both syntactic flaws and semantic business logic errors.
- Training data for LLMs should include real-world insecure-to-secure code pairs, with particular attention to configuration and logging errors.
- Integration of dual-judge tools into developer IDEs and code-review pipelines can provide defense-in-depth against both surface and subtle vulnerabilities.
- A plausible implication is that reasoning-upgraded LLMs and tailored prompt engineering (e.g., few-shot insecure samples, explicit security warnings) are necessary for practical deployment in security-critical environments.
6. Contextual Significance and Connections to Broader Security Benchmarks
SafeGenBench complements and extends prior LLM safety benchmarks by focusing exclusively on model-generated code vulnerabilities rather than general prompt injection or harmful output detection. The dual-judge methodology and rich taxonomy distinguish it from frameworks such as SG-Bench (Mou et al., 2024) and GenTel-Safe (Li et al., 2024), which address general safety/alignment and prompt injection, respectively. Unlike those, SafeGenBench targets secure software engineering by exposing the full spectrum of LLM-induced weaknesses in code, from low-level memory safety to configuration and logic flaws.
The empirical demonstration of high vulnerability rates, multifaceted metric reporting, and actionable pipeline recommendations position SafeGenBench as an authoritative foundation for ongoing research in LLM-driven secure software development (Li et al., 6 Jun 2025).