CICIoT2023 Dataset
- CICIoT2023 is a real-time IoT network traffic dataset capturing benign activity and 33 distinct attack types from 105 devices.
- The dataset includes over 47 million labeled flows with detailed flow and packet-level features, supporting robust ML and IDS evaluations.
- Its comprehensive attack taxonomy and preprocessing pipelines enable effective benchmarking for traditional, deep, federated, and LLM-based cybersecurity methods.
CICIoT2023 is a large-scale, real-time network traffic dataset created by the Canadian Institute for Cybersecurity to support the development and benchmarking of intrusion detection systems (IDS) for Internet-of-Things (IoT) environments. Comprising traffic from 105 heterogeneous devices—including smart cameras, sensors, microcontrollers, and Zigbee/Z-Wave endpoints—the dataset captures both benign activity and a broad spectrum of realistic attack scenarios. Its extensive size, coverage of 33 unique attack types plus benign traffic, and granular flow- and packet-level statistics position CICIoT2023 as a core resource for research on machine learning, federated learning, and LLM–driven cybersecurity in IoT contexts.
1. Dataset Composition and Scope
CICIoT2023 consists of packet-capture (PCAP) and processed CSV files representing 169 traffic traces collected from 67 actively communicating IoT devices and 38 Zigbee/Z-Wave endpoints, all deployed in a controlled testbed across five home automation hubs. The dataset encapsulates real-world threat emulation: all malicious events reflect attacks initiated by compromised devices targeting other IoT systems within the network, ensuring that attack signatures align closely with observed adversarial behaviors in practical deployments (Gueriani et al., 2024, Gad et al., 20 Nov 2025, Diaf et al., 3 Jan 2025, Sudasinghe et al., 21 Jan 2026).
The primary data modalities are:
- Flows: Each bidirectional network flow (e.g., TCP/UDP sessions) is described by 45–46 numeric features per record, depending on downstream preprocessing or toolchain.
- Packets: In some studies, individual packet records are extracted (up to 71 features) using Tranalyzer and similar tools.
- Labels: Every sample is annotated with its corresponding class: "Benign" or one of 33 attack types, hierarchically grouped into seven broad categories.
The full dataset spans over 46.6 million labeled flow samples (Narayan et al., 2023), with some works extracting targeted, balanced subsets (on the order of 1–2 million samples) to accommodate hardware and algorithmic constraints (Gueriani et al., 2024).
2. Attack Taxonomy and Class Distribution
CICIoT2023’s attack taxonomy is hierarchical, encompassing seven top-level categories, within which a total of 33 distinct attack variants are enumerated:
| High-Level Category | Example Attack Types |
|---|---|
| DDoS | ACK fragmentation, UDP flood, SlowLoris, ICMP flood, RSTFIN flood, HTTP flood, SYN flood, etc. |
| Brute-Force | Dictionary brute-force login attempts |
| Spoofing | ARP spoofing, DNS spoofing |
| DoS | TCP flood, HTTP flood, SYN flood, UDP flood (single-source DoS) |
| Reconnaissance | Ping sweep, OS scan, Vulnerability scan, Port scan, Host discovery |
| Web-Based | SQL injection, Command injection, Backdoor malware, File-upload, Cross-site scripting, Hijacking |
| Mirai-Family | GREIP flood, GREeth flood, UDPPlain variants exploiting Mirai botnet code |
Class proportions are highly imbalanced in the raw dataset. For example, DDoS-ICMP_Flood is the most frequent attack, while Uploading_Attack is the rarest, with a class-count ratio approaching 5,751:1 in extreme cases (Narayan et al., 2023). This pronounced imbalance motivates explicit sub-sampling and balanced training set construction in most contemporary experiments (Gad et al., 20 Nov 2025, Sudasinghe et al., 21 Jan 2026).
3. Feature Representation and Raw Data Structure
CICIoT2023 provides high-dimensional feature vectors for each sample, with feature cardinality depending on record type and post-processing pipeline:
- Flow-Level Features (45–46 per record): Includes flow duration, header length, byte and packet counts, TCP flag frequencies, protocol indicators (HTTP, DNS, etc.), source/destination rates, and general network statistics. Full feature tables are partially supplied in some papers; specific definitions and formulas for every feature are not always made public (Gueriani et al., 2024, Narayan et al., 2023, Gad et al., 20 Nov 2025).
- Packet-Level Features (up to 71 raw, reduced to 23–28 post-selection): Cover L2–L4 header fields (MAC, IP, port numbers, flags), packet/flow timing, size distributions, and derived statistical summaries. Feature selection is generally automated through variance and correlation filtering (e.g., remove features with variance < 25% or |ρ| > 0.98/0.90), yielding a final feature vector per sample (typical m=23–28 for LLM/driven experiments) (Diaf et al., 3 Jan 2025, Sudasinghe et al., 21 Jan 2026, Diaf et al., 2024).
Distinct works define additional engineered features—for example, computation of average packet inter-arrival time (), flow rate (), or header ratio summaries—to enhance discriminative power for machine learning models.
4. Preprocessing, Data Cleaning, and Partitioning Strategies
CICIoT2023 experiments employ varied data preparation pipelines, matching the specific requirements of traditional ML, deep neural, federated, and LLM-based frameworks:
- Feature selection: Some studies retain all features; others prune by variance/correlation thresholds (Diaf et al., 2024, Diaf et al., 3 Jan 2025, Sudasinghe et al., 21 Jan 2026).
- Normalization: Common approaches include min–max or z-score (standardization to zero mean/unit variance), particularly for flow-based ML models (Narayan et al., 2023, ElSayed et al., 2024). LLM pipelines often use raw or rounded feature values within text-based prompt templates (Sudasinghe et al., 21 Jan 2026).
- Encoding: Categorical features (e.g., protocol type) are typically ordinally encoded or used as binary indicators (HTTP = 1/0). Label encoding is generally flat (one-hot or integer-valued).
- Class balancing: To overcome extreme class imbalance, the majority of studies resort to random undersampling (to a fixed sample count per class), random oversampling (ROS), or balanced random forest bootstrapping (BRFC) (Narayan et al., 2023, Gad et al., 20 Nov 2025). For LLM fine-tuning, per-class balancing is standard (e.g., exactly 500–1,000 samples per class for seen/unseen splits) (Sudasinghe et al., 21 Jan 2026).
- Splitting: Partitioning strategies include random train/validation/test splits (e.g., 80/20, 75/25, 70/15/15), hold-out sets, k-fold cross-validation (rare), and specialized federated or leave-class-out splits (for zero-shot evaluation) (Gueriani et al., 2024, Gad et al., 20 Nov 2025, Sudasinghe et al., 21 Jan 2026).
Augmentation pipelines for neural sequence models can further include subflow generation, jitter injection, noise perturbations, and artificial packet insertions (Panopoulos et al., 22 Jun 2025).
5. Use Cases and Experimental Protocols
CICIoT2023 serves as the foundational benchmark for a spectrum of IDS and IoT security research efforts, including:
- Traditional ML/Random Forest: Baseline IDSs using feature engineering, normalization, and class balancing, achieving significant F₁-score improvements through stricter balancing and feature subset selection (Narayan et al., 2023).
- Deep Learning (CNN, LSTM, Autoencoder, Transformer): End-to-end neural architectures exploit the high dimensionality and temporal structure of the dataset, with CNN-LSTM hybrids, Transformer encodings, and temporal-aware augments to enhance early detection and sequence modelling capabilities (Gueriani et al., 2024, Panopoulos et al., 22 Jun 2025, ElSayed et al., 2024).
- Federated Learning: Experiments on federated model training explore the effects of IID versus non-IID dataset partitioning, category-wise client allocation, and the impacts of statistical heterogeneity and balancing on model convergence and attack detection (Gad et al., 20 Nov 2025).
- LLM-based Architectures: Recent studies integrate LLMs (GPT, BERT, BART, LLaMA) for both traffic prediction (structured-to-text conversion, prompt-based classification) and zero-shot threat detection (retrieval-augmented generation, leave-class-out evaluation), leveraging text-formatted features and transformer pipelines (Diaf et al., 2024, Diaf et al., 3 Jan 2025, Sudasinghe et al., 21 Jan 2026).
Data splits, balancing, and augmentation are carefully adapted in each experimental context to support model robustness, mitigate skew, and enable both seen and unseen attack detection.
6. Statistical Properties and Limitations
While the full dataset encompasses approximately 47 million labeled flows (Gueriani et al., 2024, Narayan et al., 2023), most published experiments operate on curated subsets due to computational or storage constraints—typically extracting up to 1–2 million records or balanced mini-corpora (e.g., 500–1,000 samples per class) (Sudasinghe et al., 21 Jan 2026, Diaf et al., 3 Jan 2025). The dataset’s raw form is characterized by:
- Severe class imbalance: Sample counts per attack can vary by factors exceeding 103–104, underscoring the importance of strict balancing and evaluation via class-weighted metrics (e.g., per-class F₁ for “unsaturated” classes) (Narayan et al., 2023).
- Feature distribution: Means, variances, and detailed distribution tables are infrequently published; most works report only that feature set statistics are consistent across train/validation/test splits after balancing and partitioning (Sudasinghe et al., 21 Jan 2026).
- Protocol diversity and attack/benign mixture: Abundant cross-protocol flows (TCP, UDP, ICMP, HTTP(S), DNS) and realistic benign device chatter co-occur, challenging overfitting and improving ecological validity.
Known limitations include lack of explicit public formulas for all features, limited per-class sample count disclosure, hardware-constrained subset selection, and—where applicable—absence of formal cross-validation in evaluation protocols (Gueriani et al., 2024, Gad et al., 20 Nov 2025).
7. Impact and Applications in Current Research
CICIoT2023 has established itself as a core reference for evaluating novel IDS models (random forest, neural, hybrid, federated), proactive cyberthreat prediction using LLMs, and zero-shot detection via retrieval-augmented generative models. Performance benchmarks on the dataset include:
- CNN-LSTM hybrid IDS: 98.42% accuracy, F₁-score 98.57% (binary malicious/benign split) (Gueriani et al., 2024)
- Random Forest with balanced classes: per-class F₁ improvement up to 7.9% on hardest classes (Narayan et al., 2023)
- LLM-driven models (BART, LLaMA with RAG): 98%+ overall accuracy for known attacks; ~43% accuracy for zero-shot (unseen class) detection (Diaf et al., 3 Jan 2025, Sudasinghe et al., 21 Jan 2026)
The combination of scale, heterogeneity, and ground-truth labeling supports not only robust performance evaluation but also ablation studies on class imbalance, statistical heterogeneity, and cross-protocol generalization. Its continued adoption by the research community has accelerated the development of scalable, adaptable IDS and zero-trust security architectures in IoT contexts (ElSayed et al., 2024, Diaf et al., 2024).
References
- (Gueriani et al., 2024) Enhancing IoT Security with CNN and LSTM-Based Intrusion Detection Systems
- (Gad et al., 20 Nov 2025) A Robust Federated Learning Approach for Combating Attacks Against IoT Systems Under non-IID Challenges
- (Narayan et al., 2023) IIDS: Design of Intelligent Intrusion Detection System for Internet-of-Things Applications
- (ElSayed et al., 2024) A Novel Zero-Trust Machine Learning Green Architecture for Healthcare IoT Cybersecurity: Review, Analysis, and Implementation
- (Sudasinghe et al., 21 Jan 2026) Lightweight LLMs for Network Attack Detection in IoT Networks
- (Diaf et al., 2024) Beyond Detection: Leveraging LLMs for Cyber Attack Prediction in IoT Networks
- (Diaf et al., 3 Jan 2025) BARTPredict: Empowering IoT Security with LLM-Driven Cyber Threat Prediction
- (Panopoulos et al., 22 Jun 2025) Dynamic Temporal Positional Encodings for Early Intrusion Detection in IoT