CTIBench: CTI LLM Benchmark

Updated 9 February 2026

CTIBench is a benchmark suite evaluating large language models on CTI tasks through four analyst-centric competencies: factual recall, taxonomy mapping, vulnerability severity prediction, and threat actor attribution.
It leverages authoritative CTI data sources like MITRE ATT&CK, NVD, CWE, and FIRST CVSS to create reproducible, quantifiable assessments for cybersecurity operations.
Evaluation results reveal strong recall and numeric reasoning in models like GPT-4, while exposing challenges in nuanced threat actor attribution and contextual calibration.

CTIBench is a benchmark suite designed to systematically evaluate the capabilities of LLMs on core tasks in Cyber Threat Intelligence (CTI). Addressing the lack of specialized and actionable CTI LLM benchmarks, CTIBench operationalizes four key analyst-centric competencies: factual recall, taxonomy mapping, vulnerability severity prediction, and abductive reasoning for threat actor attribution. This benchmark provides a well-defined evaluation framework, draws from authoritative CTI data sources, and yields quantitative insights about LLM performance and limitations specific to cybersecurity operations (Alam et al., 2024).

1. Design Objectives and Scope

CTIBench’s primary goals are to:

Assess LLM recall of CTI standards (e.g., MITRE ATT&CK, CWE, NIST, GDPR).
Evaluate semantic mapping from descriptive vulnerability reports to formal taxonomies.
Test extraction and accurate scoring of CVSS v3 Base metrics from unstructured CVE texts.
Measure the capability to attribute anonymized, complex threat reports to threat actors via reasoning over TTPs and infrastructure.

The benchmark explicitly focuses on publicly available, authoritative sources—such as MITRE ATT&CK, the National Vulnerability Database (NVD), CWE, and FIRST CVSS—to ensure relevance, reproducibility, and coverage of practical CTI workflows (Alam et al., 2024).

2. Task Formulation and Dataset Construction

CTIBench comprises four principal tasks, each mapping to a CTI analyst workflow:

A. CTI-MCQ (Multiple-Choice Questions)

Evaluates recall of knowledge relating to CTI frameworks, threat behaviors, and mitigations.
Items are curated from MITRE ATT&CK documentation, CWE analysis guides, standards, and public CTI quizzes.
Each item is a single-sentence or short-paragraph question with four candidate answers (A–D), only one of which is correct.

B. CTI-RCM (Root Cause Mapping)

Measures the ability to map the free-text description in a CVE to its underlying CWE identifier.
Source: NVD’s 2024 CVEs with validated CWE mappings; randomized sample of 1,000 unique CVEs.

C. CTI-VSP (Vulnerability Severity Prediction)

Requires extracting the eight CVSS v3.1 Base metrics and computing the overall base score from a CVE description.
Output format is a vector: CVSS:3.1/AV: ${<AV>}$ /AC: ${<AC>}$ /PR: ${<PR>}$ /UI: ${<UI>}$ /S: ${<S>}$ /C: ${<C>}$ /I: ${<I>}$ /A: ${<A>}$

D. CTI-TAA (Threat Actor Attribution)

Requires attributing a sanitized, multi-paragraph threat report (with withheld actor/campaign names) to a known adversary or group.
Ground truth and plausible alias mappings are constructed via Malpedia, MITRE, and alias-graph search.

Task	Input Type	Output	Dataset Size	Primary Data Source
CTI-MCQ	Framework QA (MCQ)	Single choice (A–D)	2,500	MITRE, CWE, quizzes
CTI-RCM	CVE description (text)	CWE-ID	1,000	NVD 2024
CTI-VSP	CVE description (text)	CVSS 3.1 vector / score	1,000	NVD 2024
CTI-TAA	Sanitized threat report (text)	Threat actor + rationale	50	Malpedia, vendor reports

The datasets are carefully preprocessed: removing boilerplate, filtering for unique and non-missing entries, and, for CTI-TAA, redacting explicit actor mentions. All ground truths are validated by domain experts for correctness (Alam et al., 2024).

3. Evaluation Metrics and Protocol

CTIBench adopts metrics tailored to each task:

Classification tasks (MCQ, RCM):

$\mathrm{Accuracy} = \frac{1}{N}\sum_{i=1}^{N} \mathbf{1}(\hat{y}_i = y_i)$

Severity prediction (VSP):

$\mathrm{MAD} = \frac{1}{N}\sum_{i=1}^{N} |s_i^{\mathrm{pred}} - s_i^{\mathrm{true}}|$

where ${<AC>}$ 0 is the numeric CVSS score.

Attribution (TAA):

${<AC>}$ 1

Alias mapping is resolved via alias-graph search.

For multi-class settings and information extraction: Standard definitions of precision, recall, and ${<AC>}$ 2 score are included for future extensions.

The evaluation protocol enforces zero-shot settings using fixed instruction prompts, with no task weighting—reflecting real analyst stepwise workflows. Responses are post-processed to extract and score only the final answer (Alam et al., 2024).

4. Experimental Setup and Model Assessment

CTIBench evaluated major LLM families:

Proprietary: OpenAI GPT-3.5-turbo, GPT-4, Google Gemini-1.5
Open-source: Meta Llama3-70B, Llama3-8B

All commercial models were deployed via APIs; Llama models ran on 8×A100 GPUs.

Performance summary:

Model	CTI-MCQ Acc	CTI-RCM Acc	CTI-VSP MAD	TAA CorrAcc	TAA PlausAcc
GPT-4	71.0	72.0	1.31	52%	86%
GPT-3.5	54.1	67.2	1.57	44%	62%
Gemini-1.5	65.4	66.6	1.09	38%	74%
Llama3-70B	65.7	65.9	1.83	52%	80%
Llama3-8B	61.3	44.7	1.91	28%	36%

Key observations:

GPT-4 leads in MCQ and RCM accuracy, indicating effective CTI knowledge recall and taxonomy mapping.
Gemini-1.5 achieves lowest mean absolute deviation (MAD) on severity prediction, reflecting superior numeric reasoning.
TAA accuracy is generally low, with proprietary models outperforming open-source. Plausible actor attribution—allowing for alias matching—significantly boosts counts over strict accuracy.
All models tend to overestimate severity in VSP, with description length affecting error rates. Short/overly terse and verbose CVEs both degrade performance.
Scale matters: smaller models (Llama3-8B) perform poorly on mapping and reasoning tasks.

5. Analysis of Model Strengths, Weaknesses, and Failure Modes

Strengths:

High accuracy in CTI-MCQ and RCM for GPT-4, showing robust recall of standards, frameworks, and taxonomies.
Numeric reasoning in VSP is best in Gemini-1.5, with lowest MAD and consistent score derivation from CVE texts.

Limitations:

Overestimation bias persists across all LLMs in CVSS scoring, especially on complex or lengthy descriptions.
TAA (attribution) remains challenging: even the best models achieve only 52% accuracy for strict matches, despite human-level performance (~86% plausible with aliases).
Performance on MCQs clusters errors by mitigation and tooling, indicating weaknesses in nuance and contextual recall.
For RCM and VSP, performance is optimal at intermediate description lengths; short and verbose inputs reduce mapping fidelity.

These results indicate that while LLMs can effectively recall static knowledge and perform direct mapping with sufficient context, nuanced reasoning and calibration remain substantive challenges (Alam et al., 2024).

6. Methodological Insights and Future Directions

The CTIBench evaluation yielded several actionable research questions:

Calibrating Severity: The systematic overestimation in VSP implies potential for improved calibration using few-shot or retrieval-augmented inference with numeric examples.
Domain Adaptation: Fine-tuning LLMs on domain-specific corpora (e.g., historical CVEs, threat reports) is likely to reduce hallucinations and improve semantic mapping.
Task Expansion: Planned extensions of CTIBench include additional information-extraction tasks (IOC extraction, TTP classification), dynamic summarization of threat reports, and evaluation across multiple languages.
Human-in-the-Loop: Measuring LLM convergence and correction rate with minimal analyst feedback may offer a more pragmatic assessment of utility in live operations.

The benchmark is positioned as a flexible platform for future research, dataset enrichment, and methodology development for LLMs in security-specific roles (Alam et al., 2024).

7. Position in the CTI Benchmark Landscape

CTIBench represents the first focused, multi-task CTI LLM benchmark addressing applied analyst workflows. Subsequent benchmarks—such as AthenaBench and CTIArena—have extended these principles:

AthenaBench: Augments CTIBench with dynamic data sourcing, robust deduplication, multi-label evaluation (e.g., mitigation strategies), and unified scoring (Alam et al., 3 Nov 2025).
CTIArena: Expands the scope to nine tasks with structured, unstructured, and hybrid categories, as well as novel retrieval-augmented generation protocols that further stress cross-source and evidence-grounded reasoning (Cheng et al., 13 Oct 2025).

Despite advancements in later benchmarks, CTIBench remains foundational in establishing a replicable and interpretable framework for rigorous CTI LLM evaluation, with direct applicability and relevance to security operations and model development. All code, datasets, and evaluation pipelines are maintained to enable reproducible research (Alam et al., 2024).