Synthetic APT Dataset
- Synthetic APT datasets are engineered data collections that emulate advanced persistent threat campaigns with detailed multi-stage simulations.
- They enable systematic benchmarking of detection systems by blending benign and malicious events based on frameworks like MITRE ATT&CK.
- Advanced generation methodologies include template-based synthesis, replay-based telemetry, and real-world testbed emulation for precise labeling and stage mapping.
A synthetic APT dataset is an engineered data collection that emulates advanced persistent threat (APT) campaigns for the purpose of developing, evaluating, and benchmarking intrusion detection, behavioral analysis, and cyber defense techniques. Such datasets are essential in domains where the collection of real-world, comprehensive, and well-labeled APT data is infeasible due to operational security constraints, rarity of labeled APT events, or privacy issues. Synthetic APT datasets seek to reflect the complexity and multi-stage structure of actual cyber intrusion campaigns, often following canonical models such as the MITRE ATT&CK or Mandiant kill-chain frameworks. Advanced frameworks allow for fine-grained labeling, coverage of multiple environments (enterprise, IIoT, host-based), inclusion of provenance, and the integration of both benign and attack activity streams.
1. Motivations and Rationale for Synthetic APT Data
The main motivation for synthetic APT datasets is the persistent lack of large-scale, realistic, and exhaustively labeled ground truth necessary for evaluating detection systems and ML models. Real APT incidents are rare, often underreported, and available data typically lacks fine annotation of attack phases or step-level ground truth. Moreover, the stealthy, multi-stage, and heterogeneous manifestation of APTs across different infrastructures increases the need for datasets that can model variability, stealth, and noise at scale. Synthetic data provides:
- Systematic control over scenario complexity, diversity, and label granularity.
- Explicit embedding of multi-stage campaign semantics (e.g., entry, C2, privilege escalation, discovery, exfiltration).
- The ability to incorporate new attack techniques or to simulate unseen attacks for transfer learning and generalization testing.
- Publicly reproducible benchmarks for comparison of detection, attribution, and response algorithms (Tijjani et al., 10 Jan 2026, Huang et al., 2024, Ghiasvand et al., 2024, Mamun et al., 2021).
2. Data Generation Methodologies and Architectures
Synthetic APT dataset generation typically follows one of several methodologies depending on the target environment and use-case.
- Template-Based Campaign Synthesis: SAGA (Huang et al., 2024) and S-DAPT-2026 (Tijjani et al., 10 Jan 2026) generate attack and benign event streams by defining templates corresponding to each APT lifecycle stage and attack technique. Attack patterns are abstracted into reusable templates, parameterized by descriptors for system entities, then instantiated using randomization or user-controlled configurations.
- Replay-Based Telemetry Synthesis: The DARPA OpTC dataset (Mamun et al., 2021) uses a workload simulator to overlay scripted red-team “APT” actions atop background benign activities, with process trees and event graphs constructed post hoc.
- Emulated IIoT Testbed Execution: CICAPT-IIoT (Ghiasvand et al., 2024) builds a real-world instrumented IIoT environment, executes low-and-slow APT campaigns using automation frameworks (e.g., MITRE Caldera), and logs network, audit, and provenance data.
- Multi-Source Blending: Some datasets allow blending of measured benign logs and algorithmically generated malicious logs, or overlaying attacks on top of background traces for increased realism (Huang et al., 2024, Mamun et al., 2021).
Key attributes incorporated include alert/event type, layer-specific identifiers (IP, port, host, PID), time stamps, mapping to campaign phases or ATT&CK techniques, and often a full provenance or task structure.
3. Dataset Schema, Staging, and Labeling
Synthetic APT datasets are rigorously structured to support algorithmic analysis and benchmarking:
- Event Schema: Typical entries encode ⟨subject, operation, object⟩ along with auxiliary process/network attributes. SAGA (Huang et al., 2024) enriches each event with APT stage, Technique ID, Ability ID, manipulated entities, and threat actor.
- Stage Mapping and Labeling: Each atomic event or alert is mapped to an explicit campaign step. For example, S-DAPT-2026 (Tijjani et al., 10 Jan 2026) defines five canonical stages—A (Entry), B (C2), C (Privilege Escalation), D (Discovery), E (Exfiltration)—and maps each alert type to its stage via a surjective function.
- Labeling Conventions: Labels use schemes such as BIO2 (begin/inside/out of a labeled region) to annotate malicious spans and their associated techniques or stages (Huang et al., 2024).
- Dataset Statistics: Datasets provide not only event counts, malicious/benign ratios, and scenario occurrences but also campaign durations, scenario indices, and coverage of alert types or attack techniques.
An example schema from SAGA (Huang et al., 2024):
| Field | Type | Notes |
|---|---|---|
| subject | string | Process or principal |
| operation | string | Action (e.g., FileCreate) |
| object | string | Target entity |
| APT_stage | enum | One of 7 lifecycle phases |
| Technique_ID | string | MITRE ATT&CK Technique ID |
| Ability_ID | string | Caldera ability |
| Label | BIO2 | B-, I-, or O-label |
4. Alert Correlation, Scenario Synthesis, and Clustering Frameworks
Modern synthetic APT datasets incorporate explicit frameworks for composing, segmenting, and validating multi-stage APT scenarios:
- Correlation Frameworks: S-DAPT-2026 (Tijjani et al., 10 Jan 2026) introduces a mutual-kNN-based correlation module operating over cosine similarity of embedded alert features. Clusters are formed if candidate alerts represent distinct, stage-ordered steps within a bounded time window, with strict rules to avoid degenerate or redundant groupings.
- Scenario Synthesis: Algorithms assemble full or partial APT scenarios by sampling one or more alerts per required stage, following a prescribed or randomly drawn template. The presence of correlated alerts in stage order is used to characterize the scenario as full (A∧B∧C∧D∧E) or partial.
- Campaign Attribution: Some frameworks, such as the SAGA SFM module, support campaign identification by subgraph matching, returning ranked campaign hypotheses per observed event sequence (Huang et al., 2024).
These mechanisms enable fine-grained benchmarking of detection algorithms not simply on alert-level detection but also on scenario-level awareness and prediction.
5. Statistical Properties, Benchmarks, and Performance Metrics
Synthetic APT datasets provide extensive statistical analysis and offer standard benchmarks for the research community.
- Dataset Sizes and Composition:
- S-DAPT-2026 (Tijjani et al., 10 Jan 2026): 120 000 total alerts, 48 000 APT (40%), 8 100 full scenarios, span 6 months.
- SAGA (Huang et al., 2024): logs per known campaign range from ∼900 k (15 min) to 14 M (1 day) events, with 14–1 133 malicious events per campaign.
- CICAPT-IIoT (Ghiasvand et al., 2024): phase 2 mixed data includes 52 954 benign + 330 attack provenance nodes; 9.5 M benign + 1 k attack network packets.
- OpTC (Mamun et al., 2021): millions of host telemetry events per user over days, including tens of thousands of task-sequences.
- Label and Scenario Distributions: Datasets characterize the proportion of alerts by stage, scenario index, campaign duration, and coverage per technique.
- Benchmarks: Detection utility is quantified by metrics such as accuracy, precision, recall, F1, and AUC for both event-level and scenario-level detection. For instance, DeepTaskAPT achieves accuracy >0.98 and recall up to 0.88 on OpTC, while SAGA reports near-perfect F1 (≥0.96) for graph-based detectors such as Unicorn (Huang et al., 2024, Mamun et al., 2021).
- Scenario-Level Evaluation: Metrics like correlation index (number of preserved adjacent-stage transitions) and scenario labels (two-step, three-step, etc.) support evaluation of multi-stage detection strategies.
6. Extensibility, Adaptation, and Environment Coverage
Synthetic APT dataset frameworks are designed for extensibility and adaptation across environments and scenarios:
- Alert/Technique Integration: New alert types or ATT&CK techniques can be incorporated by extending template sets, updating mapping and grammar rules, and adjusting correlation and clustering parameters (Tijjani et al., 10 Jan 2026, Huang et al., 2024).
- Custom Campaigns and Durations: SAGA (Huang et al., 2024) allows arbitrary campaign duration and compositional mixing of known, randomly generated, or composite APT scenarios, supporting diverse adversarial models.
- Environment Specificity: Datasets are generated for campus/enterprise networks (S-DAPT-2026), IIoT testbeds (CICAPT-IIoT), and host-based Windows environments (OpTC, SAGA), enabling cross-domain validation and transfer learning for detection models.
- Provenance and Cross-Modality: Datasets such as CICAPT-IIoT (Ghiasvand et al., 2024) fuse provenance, network, and audit logs, supporting advanced analytics such as graph embedding and temporal modeling in addition to conventional sequence classification.
7. Use Cases, Limitations, and Future Directions
Synthetic APT datasets serve as reference benchmarks for:
- Training, cross-validation, and comparative assessment of ML models (LSTM, GNN, Random Forest, subgraph matching) for APT detection, stage attribution, and campaign forecasting.
- Transfer learning and generalization testing, e.g., training on SAGA and validating on external red-team campaigns (Huang et al., 2024).
- Evaluation of alert correlation, scenario reduction, and anomaly detection strategies under controlled label and scenario complexity (Tijjani et al., 10 Jan 2026).
Recognized limitations include the risk of simulation/modeling bias, the challenge of matching real attacker variability, and, for fully synthetic traces, the absence of genuine adversary adaptive tactics (as in OpTC (Mamun et al., 2021)). A plausible implication is that future research may combine synthetic datasets with real-world traces, employ adversarial training, or further refine scenario composition grammars to narrow the representational gap.
Synthetic APT datasets, by providing systematic, well-annotated, multistage emulations of advanced threats, remain foundational to rigorous and reproducible security analytics research (Tijjani et al., 10 Jan 2026, Huang et al., 2024, Ghiasvand et al., 2024, Mamun et al., 2021).