CryptoAPI-Bench: A Java Cryptography Benchmark

Updated 25 January 2026

CryptoAPI-Bench is a synthetic benchmark suite that rigorously evaluates static analysis tools for detecting misuses in Java cryptographic APIs.
It comprises 171–181 self-contained Java test cases spanning basic and advanced patterns including flow-, inter-, field-, and path-sensitive analyses.
Evaluation metrics such as precision, recall, and F₁-score are used to compare tool effectiveness, highlighting challenges like path-sensitivity.

CryptoAPI-Bench is a synthetic benchmark suite designed to rigorously evaluate the effectiveness of static analysis tools in detecting misuses of the Java Cryptography Architecture (JCA) and related SSL/TLS APIs. It formalizes a wide spectrum of cryptographic API misuse patterns found in real-world Java applications, providing a controlled and reproducible framework for comparative studies of program analysis methodologies in this security-critical domain.

1. Design Objectives and Scope

CryptoAPI-Bench was created to address the lack of a standardized, comprehensive testbed for benchmarking cryptographic API misuse detectors in Java. The primary design goals are:

Breadth of Rule Coverage: To systematically cover high-impact cryptographic misuses, the suite assembles test cases for vulnerabilities distilled from NIST guidelines, prior empirical studies, and industrial security advisories. The exact number of misuse categories varies by source: the CamBench critique counts 16 types (Schlichtig et al., 2022), while the main evaluation papers enumerate 12 (Firouzi et al., 2024) to 18 (Afrose et al., 2021), capturing issues such as hard-coded keys, weak cipher modes, insecure randomness, constant IVs, and SSL/TLS trust breakdowns.
Analytic “Interest”: Beyond simple pattern matching, CryptoAPI-Bench constructs cases that target advanced static analysis dimensions: flow-sensitivity (tracking value assignment order), context- and interprocedural sensitivity (across method and class boundaries), field-sensitivity (distinguishing among object attributes), path-sensitivity (data flow conditioned on control flow), and hybrid scenarios combining several dimensions (Afrose et al., 2021).
Practicability: To facilitate reproducible experiments, each test case is a small, self-contained Java method or class, compilable independently and requiring only the core Java classpath.
Synthetic by Construction: All cases are handcrafted by security experts drawing on catalogued empirical misuse patterns (e.g., CryptoGuard studies), rather than extracted from real-world projects (Schlichtig et al., 2022).

2. Structure, Taxonomy, and Case Generation

The benchmark consists of 171–181 Java code units, depending on version, organized as follows:

Subgroup	Total Cases	Description
Basic	40–45	Intra-method, flow-insensitive code snippets
Two-method	21	Interprocedural (two methods)
Three-method	21	Interprocedural (three methods)
Field-sensitive	20	Object attribute taint tracking
Combined	21	Interprocedural plus field flows
Path-sensitive	20	Data flow varies by control branch
Miscellaneous	12	Prunes irrelevant non-security constants
Multiple-class	21	Cross-file, cross-class data propagation

Basic cases present the entire security-relevant data flow within a single method and can be detected by pattern matching with minimal dataflow sensitivity.
Advanced cases (131+) exercise one or more of: inter-method propagation, field-sensitivity, path-sensitivity, cross-class/value propagation, or combined effects (Schlichtig et al., 2022, Afrose et al., 2021).
Each test is labeled with ground truth as vulnerable or secure, and annotated with the program-analysis dimensions it exercises (Afrose et al., 2021).

Misuse patterns include, for example:

Weak symmetric ciphers (DES, RC4), ECB mode without IV, hard-coded or predictable keys/salts/passwords, deprecated hash functions (MD5/SHA-1), insufficient key derivation parameters, insecure random number generation, constant IVs, and SSLContext misconfigurations (Firouzi et al., 2024, Schlichtig et al., 2022).

3. Domains, Coverage, and Representativeness

CryptoAPI-Bench is scoped exclusively to:

Language: Java
API Layer: JCA, javax.crypto, java.security, and associated SSL/TLS interfaces (X509TrustManager, HostnameVerifier).
Analysis Style: Static analysis; no dynamic, taint, or runtime-only code paths included.
Misuse Types: The exact taxonomy varies (12, 16, or 18) based on curation. Coverage is measured both as the fraction of vulnerability categories for which a tool produces at least one correct flag, and as the raw recall across all unit tests (Afrose et al., 2021).

Notably, no Android-specific APIs, reflection-heavy code, nor non-cryptographic security issues are included. CryptoAPI-Bench is delivered as a flat source-code repository, with code files named by pattern and test IDs.

4. Evaluation Methodology, Metrics, and Protocol

Evaluators utilize information-retrieval metrics applied to each test’s ground-truth label:

Precision: $TP / (TP + FP)$
Recall: $TP / (TP + FN)$
F₁-Score: $2 \cdot (\mathrm{Precision} \cdot \mathrm{Recall}) / (\mathrm{Precision} + \mathrm{Recall})$
False-positive Rate: $FP / (FP + TN)$
Accuracy: $(TP + TN) / (TP + TN + FP + FN)$

Where $TP$ = correctly flagged insecure, $FP$ = flagged but actually secure, $TN$ = correctly unflagged secure, $FN$ = missed insecure (Firouzi et al., 2024).

The standard evaluation protocol is:

Run the analysis tool on the entire suite (default settings).
Collect flags and match by file/line/pattern ID against ground truth.
Tabulate $TP, FP, TN, FN$ and compute evaluation metrics (Schlichtig et al., 2022, Afrose et al., 2021).

Coverage and recall are further stratified by subgroup—basic, advanced, field-sensitive, path-sensitive, etc.—to localize analytic limitations.

5. Results and Tool Comparisons

Published studies compare prominent static analysis tools, including CryptoGuard, CrySL (CogniCrypt), Coverity, SpotBugs (FindSecBugs), and, more recently, LLMs:

Tool	Precision (Basic/Advanced)	Recall (Basic/Advanced)	Observed Limitations
CryptoGuard	100% / 83–100%	100% / 93–96%	Misses a few advanced due to call clipping
CrySL	59–71% / 56–58%	59–71% / 56–58%	Rule strictness, path-insensitivity
SpotBugs	81–92% / 0%	93% / 0%	No inter/field/path-sensitivity
Coverity	93–100% / 19–52%	93–100% / 19%	Only intra-procedural coverage, low recall
ChatGPT	78–88% / 88–96%	78–100% / 83–100%	Improved with prompt engineering (Firouzi et al., 2024)

Domain-specific tools (CryptoGuard, CrySL) exhibit higher category recall on advanced flows compared to general-purpose tools but can suffer from spurious alerts (CrySL) or moderate incompleteness on orthogonal call flows (CryptoGuard) (Afrose et al., 2021).
SpotBugs and Coverity achieve perfect or near-perfect recall only on simple, intra-method patterns.
Across all, path-insensitivity is a consistent source of false positives and negatives, with none of the tools implementing path-sensitive data-flow beyond trivial branches (Afrose et al., 2021).
ChatGPT (GPT-3.5-Turbo), when subjected to systematic prompt engineering, outperforms CryptoGuard on most categories, attaining average F₁-scores up to 94.6% (Firouzi et al., 2024).

6. Transparency, Documentation, and Extensibility

Distribution: CryptoAPI-Bench is published as a flat repository of annotated Java source files, but without a pre-registered test generation methodology or versioned metadata. There is no formal criteria for permutation or coverage selection beyond the “cover all patterns with enough analytic diversity” guidance; all extensions are manual (Schlichtig et al., 2022).
Extending the Suite: Users can introduce new test cases by writing additional Java files, labeling as “basic” or “advanced,” and updating an ad-hoc ground-truth mapping file. No automated harness, test generation, or metadata schema is supplied.
Documentation Gaps: CamBench’s review highlights the absence of documentation on advanced permutation inclusion, gaps in API class coverage, and lack of community review for new cases (Schlichtig et al., 2022). These gaps motivate more systematized alternatives such as CamBench, which offers pre-registration, versioned metadata, and API-coverage analysis.

7. Limitations, Impact, and Research Directions

Coverage Gaps: There remain missing test cases for certain “default” cryptographic behaviors (e.g., implicit ECB mode), CBC misuse scenarios, subtleties of salt/IV reuse across invocations, arithmetic-computed arguments, and comprehensive padding errors (Firouzi et al., 2024).
Scalability: While CryptoAPI-Bench is optimized for source-level, unit-test–style analysis, it does not provide scalability insights on real codebases; the related ApacheCryptoAPI-Bench serves that need (Afrose et al., 2021).
Research Impact: The suite is widely adopted for tool comparison, exposing key analytic limitations such as the lack of path-sensitivity and the challenge of discriminating context-dependent secure usages. It encourages the development of path-sensitive or incremental data-flow methods and motivates enriched rule languages that can distinguish secure from insecure constant usage patterns (Afrose et al., 2021, Firouzi et al., 2024).
Tool Usage: CryptoAPI-Bench continues to serve as a de facto reference for JCA misuse detection, but its lack of methodical transparency and incomplete documentation motivate ongoing work on systematic, meta-data–rich benchmarking suites.

A plausible implication is that future benchmarks for cryptographic misuse detection should combine fine-grained, synthetic test cases with real-world code snippets, must formally document their construction process, and provide automated harnesses for extension and evaluation (Schlichtig et al., 2022).

Markdown Report Issue Upgrade to Chat

References (3)

CamBench -- Cryptographic API Misuse Detection Tool Benchmark Suite (2022)

ChatGPT's Potential in Cryptography Misuse Detection: A Comparative Analysis with Static Analysis Tools (2024)

Evaluation of Static Vulnerability Detection Tools with Java Cryptographic API Benchmarks (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CryptoAPI-Bench.

CryptoAPI-Bench: A Java Cryptography Benchmark

1. Design Objectives and Scope

2. Structure, Taxonomy, and Case Generation

3. Domains, Coverage, and Representativeness

4. Evaluation Methodology, Metrics, and Protocol

5. Results and Tool Comparisons

6. Transparency, Documentation, and Extensibility

7. Limitations, Impact, and Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

CryptoAPI-Bench: A Java Cryptography Benchmark

1. Design Objectives and Scope

2. Structure, Taxonomy, and Case Generation

3. Domains, Coverage, and Representativeness

4. Evaluation Methodology, Metrics, and Protocol

5. Results and Tool Comparisons

6. Transparency, Documentation, and Extensibility

7. Limitations, Impact, and Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research