Synthetic APT Dataset

Updated 17 January 2026

Synthetic APT datasets are engineered data collections that emulate advanced persistent threat campaigns with detailed multi-stage simulations.
They enable systematic benchmarking of detection systems by blending benign and malicious events based on frameworks like MITRE ATT&CK.
Advanced generation methodologies include template-based synthesis, replay-based telemetry, and real-world testbed emulation for precise labeling and stage mapping.

A synthetic APT dataset is an engineered data collection that emulates advanced persistent threat (APT) campaigns for the purpose of developing, evaluating, and benchmarking intrusion detection, behavioral analysis, and cyber defense techniques. Such datasets are essential in domains where the collection of real-world, comprehensive, and well-labeled APT data is infeasible due to operational security constraints, rarity of labeled APT events, or privacy issues. Synthetic APT datasets seek to reflect the complexity and multi-stage structure of actual cyber intrusion campaigns, often following canonical models such as the MITRE ATT&CK or Mandiant kill-chain frameworks. Advanced frameworks allow for fine-grained labeling, coverage of multiple environments (enterprise, IIoT, host-based), inclusion of provenance, and the integration of both benign and attack activity streams.

1. Motivations and Rationale for Synthetic APT Data

The main motivation for synthetic APT datasets is the persistent lack of large-scale, realistic, and exhaustively labeled ground truth necessary for evaluating detection systems and ML models. Real APT incidents are rare, often underreported, and available data typically lacks fine annotation of attack phases or step-level ground truth. Moreover, the stealthy, multi-stage, and heterogeneous manifestation of APTs across different infrastructures increases the need for datasets that can model variability, stealth, and noise at scale. Synthetic data provides:

Systematic control over scenario complexity, diversity, and label granularity.
Explicit embedding of multi-stage campaign semantics (e.g., entry, C2, privilege escalation, discovery, exfiltration).
The ability to incorporate new attack techniques or to simulate unseen attacks for transfer learning and generalization testing.
Publicly reproducible benchmarks for comparison of detection, attribution, and response algorithms (Tijjani et al., 10 Jan 2026, Huang et al., 2024, Ghiasvand et al., 2024, Mamun et al., 2021).

2. Data Generation Methodologies and Architectures

Synthetic APT dataset generation typically follows one of several methodologies depending on the target environment and use-case.

Template-Based Campaign Synthesis: SAGA (Huang et al., 2024) and S-DAPT-2026 (Tijjani et al., 10 Jan 2026) generate attack and benign event streams by defining templates corresponding to each APT lifecycle stage and attack technique. Attack patterns are abstracted into reusable templates, parameterized by descriptors for system entities, then instantiated using randomization or user-controlled configurations.
Replay-Based Telemetry Synthesis: The DARPA OpTC dataset (Mamun et al., 2021) uses a workload simulator to overlay scripted red-team “APT” actions atop background benign activities, with process trees and event graphs constructed post hoc.
Emulated IIoT Testbed Execution: CICAPT-IIoT (Ghiasvand et al., 2024) builds a real-world instrumented IIoT environment, executes low-and-slow APT campaigns using automation frameworks (e.g., MITRE Caldera), and logs network, audit, and provenance data.
Multi-Source Blending: Some datasets allow blending of measured benign logs and algorithmically generated malicious logs, or overlaying attacks on top of background traces for increased realism (Huang et al., 2024, Mamun et al., 2021).

Key attributes incorporated include alert/event type, layer-specific identifiers (IP, port, host, PID), time stamps, mapping to campaign phases or ATT&CK techniques, and often a full provenance or task structure.

3. Dataset Schema, Staging, and Labeling

Synthetic APT datasets are rigorously structured to support algorithmic analysis and benchmarking:

Event Schema: Typical entries encode ⟨subject, operation, object⟩ along with auxiliary process/network attributes. SAGA (Huang et al., 2024) enriches each event with APT stage, Technique ID, Ability ID, manipulated entities, and threat actor.
Stage Mapping and Labeling: Each atomic event or alert is mapped to an explicit campaign step. For example, S-DAPT-2026 (Tijjani et al., 10 Jan 2026) defines five canonical stages—A (Entry), B (C2), C (Privilege Escalation), D (Discovery), E (Exfiltration)—and maps each alert type to its stage via a surjective function.
Labeling Conventions: Labels use schemes such as BIO2 (begin/inside/out of a labeled region) to annotate malicious spans and their associated techniques or stages (Huang et al., 2024).
Dataset Statistics: Datasets provide not only event counts, malicious/benign ratios, and scenario occurrences but also campaign durations, scenario indices, and coverage of alert types or attack techniques.

An example schema from SAGA (Huang et al., 2024):

Field	Type	Notes
subject	string	Process or principal
operation	string	Action (e.g., FileCreate)
object	string	Target entity
APT_stage	enum	One of 7 lifecycle phases
Technique_ID	string	MITRE ATT&CK Technique ID
Ability_ID	string	Caldera ability
Label	BIO2	B-, I-, or O-label

4. Alert Correlation, Scenario Synthesis, and Clustering Frameworks

Modern synthetic APT datasets incorporate explicit frameworks for composing, segmenting, and validating multi-stage APT scenarios:

Correlation Frameworks: S-DAPT-2026 (Tijjani et al., 10 Jan 2026) introduces a mutual-kNN-based correlation module operating over cosine similarity of embedded alert features. Clusters are formed if candidate alerts represent distinct, stage-ordered steps within a bounded time window, with strict rules to avoid degenerate or redundant groupings.
Scenario Synthesis: Algorithms assemble full or partial APT scenarios by sampling one or more alerts per required stage, following a prescribed or randomly drawn template. The presence of correlated alerts in stage order is used to characterize the scenario as full (A∧B∧C∧D∧E) or partial.
Campaign Attribution: Some frameworks, such as the SAGA SFM module, support campaign identification by subgraph matching, returning ranked campaign hypotheses per observed event sequence (Huang et al., 2024).

These mechanisms enable fine-grained benchmarking of detection algorithms not simply on alert-level detection but also on scenario-level awareness and prediction.

5. Statistical Properties, Benchmarks, and Performance Metrics

Synthetic APT datasets provide extensive statistical analysis and offer standard benchmarks for the research community.

Dataset Sizes and Composition:
- S-DAPT-2026 (Tijjani et al., 10 Jan 2026): 120 000 total alerts, 48 000 APT (40%), 8 100 full scenarios, span 6 months.
- SAGA (Huang et al., 2024): logs per known campaign range from ∼900 k (15 min) to 14 M (1 day) events, with 14–1 133 malicious events per campaign.
- CICAPT-IIoT (Ghiasvand et al., 2024): phase 2 mixed data includes 52 954 benign + 330 attack provenance nodes; 9.5 M benign + 1 k attack network packets.
- OpTC (Mamun et al., 2021): millions of host telemetry events per user over days, including tens of thousands of task-sequences.
Label and Scenario Distributions: Datasets characterize the proportion of alerts by stage, scenario index, campaign duration, and coverage per technique.
Benchmarks: Detection utility is quantified by metrics such as accuracy, precision, recall, F1, and AUC for both event-level and scenario-level detection. For instance, DeepTaskAPT achieves accuracy >0.98 and recall up to 0.88 on OpTC, while SAGA reports near-perfect F1 (≥0.96) for graph-based detectors such as Unicorn (Huang et al., 2024, Mamun et al., 2021).
Scenario-Level Evaluation: Metrics like correlation index (number of preserved adjacent-stage transitions) and scenario labels (two-step, three-step, etc.) support evaluation of multi-stage detection strategies.

6. Extensibility, Adaptation, and Environment Coverage

Synthetic APT dataset frameworks are designed for extensibility and adaptation across environments and scenarios:

Alert/Technique Integration: New alert types or ATT&CK techniques can be incorporated by extending template sets, updating mapping and grammar rules, and adjusting correlation and clustering parameters (Tijjani et al., 10 Jan 2026, Huang et al., 2024).
Custom Campaigns and Durations: SAGA (Huang et al., 2024) allows arbitrary campaign duration and compositional mixing of known, randomly generated, or composite APT scenarios, supporting diverse adversarial models.
Environment Specificity: Datasets are generated for campus/enterprise networks (S-DAPT-2026), IIoT testbeds (CICAPT-IIoT), and host-based Windows environments (OpTC, SAGA), enabling cross-domain validation and transfer learning for detection models.
Provenance and Cross-Modality: Datasets such as CICAPT-IIoT (Ghiasvand et al., 2024) fuse provenance, network, and audit logs, supporting advanced analytics such as graph embedding and temporal modeling in addition to conventional sequence classification.

7. Use Cases, Limitations, and Future Directions

Synthetic APT datasets serve as reference benchmarks for:

Training, cross-validation, and comparative assessment of ML models (LSTM, GNN, Random Forest, subgraph matching) for APT detection, stage attribution, and campaign forecasting.
Transfer learning and generalization testing, e.g., training on SAGA and validating on external red-team campaigns (Huang et al., 2024).
Evaluation of alert correlation, scenario reduction, and anomaly detection strategies under controlled label and scenario complexity (Tijjani et al., 10 Jan 2026).

Recognized limitations include the risk of simulation/modeling bias, the challenge of matching real attacker variability, and, for fully synthetic traces, the absence of genuine adversary adaptive tactics (as in OpTC (Mamun et al., 2021)). A plausible implication is that future research may combine synthetic datasets with real-world traces, employ adversarial training, or further refine scenario composition grammars to narrow the representational gap.

Synthetic APT datasets, by providing systematic, well-annotated, multistage emulations of advanced threats, remain foundational to rigorous and reproducible security analytics research (Tijjani et al., 10 Jan 2026, Huang et al., 2024, Ghiasvand et al., 2024, Mamun et al., 2021).

Markdown Report Issue Upgrade to Chat

References (4)

S-DAPT-2026: A Stage-Aware Synthetic Dataset for Advanced Persistent Threat Detection (2026)

SAGA: Synthetic Audit Log Generation for APT Campaigns (2024)

CICAPT-IIOT: A provenance-based APT attack dataset for IIoT environment (2024)

DeepTaskAPT: Insider APT detection using Task-tree based Deep Learning (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Synthetic APT Dataset.

Synthetic APT Dataset

1. Motivations and Rationale for Synthetic APT Data

2. Data Generation Methodologies and Architectures

3. Dataset Schema, Staging, and Labeling

4. Alert Correlation, Scenario Synthesis, and Clustering Frameworks

5. Statistical Properties, Benchmarks, and Performance Metrics

6. Extensibility, Adaptation, and Environment Coverage

7. Use Cases, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Synthetic APT Dataset

1. Motivations and Rationale for Synthetic APT Data

2. Data Generation Methodologies and Architectures

3. Dataset Schema, Staging, and Labeling

4. Alert Correlation, Scenario Synthesis, and Clustering Frameworks

5. Statistical Properties, Benchmarks, and Performance Metrics

6. Extensibility, Adaptation, and Environment Coverage

7. Use Cases, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research