CIC-IDS2017: Benchmark for Intrusion Detection
- CIC-IDS2017 is a comprehensive intrusion detection dataset featuring realistic network traffic and diverse attack scenarios.
- It employs an extensive feature extraction pipeline with detailed flow metrics, temporal patterns, and packet statistics for model evaluation.
- The dataset is applied in both flow-based supervised ML and hybrid deep anomaly detection pipelines, achieving high detection accuracies.
The CIC-IDS2017 dataset is a benchmark intrusion detection dataset developed for evaluating the effectiveness of machine learning and deep learning approaches in network-based anomaly and misuse detection. Originating from a collaborative effort between the Canadian Institute for Cybersecurity (CIC) and Canada’s Communications Security Establishment (CSE), CIC-IDS2017 is distinguished by its breadth of attack scenarios, comprehensive feature extraction pipeline, and continued utilization in empirical studies of both flow-based and behavior-level network anomaly detection frameworks.
1. Dataset Construction and Attack Scenarios
CIC-IDS2017 was generated during a five-day network capture in a controlled laboratory environment comprising 50 attacker machines, 420 victim servers, and five emulated departmental networks. Realistic background (benign) traffic was synthesized using the “B-Profile” tool, which engineered per-host activity to mimic enterprise environments. Contemporary attack tools (e.g., hping, slowloris, Hydra) were deployed to orchestrate a wide range of attack vectors. The dataset encapsulates the following scenarios:
- Brute-force FTP and SSH attacks (“FTP-Patator”, “SSH-Patator”)
- Multiple DoS variants (Hulk, GoldenEye, slowloris, slowhttptest) and Distributed Denial-of-Service (DDoS)
- Port Scanning
- Web-level attacks (HTTP/SQL injection, XSS, etc.)
- Botnet activity and Infiltration
- Heartbleed exploit
All network traffic was captured in PCAP format and subsequently converted into flow records by CICFlowMeter-V3 (Talukder et al., 2024).
2. Feature Extraction, Structure, and Preprocessing
CICFlowMeter-V3 produces 80 numeric flow features for each network connection. These features are subdivided into:
- Basic flow identifiers: Flow ID, addresses/ports, protocol, timestamps
- Packet and byte statistics: Counts and distributions for both forward and backward streams
- Packet length and header metrics: Moments of packet sizes, header lengths, control flag counts
- Temporal metrics: Flow duration, inter-arrival times (means, minimums, maximums, standard deviations), throughput metrics (bytes/s, packets/s).
The raw dataset exhibits class imbalance, which is significant in high-fidelity modeling. For instance, the Heartbleed class has only 11 flow instances compared to 2,273,097 benign flows, resulting in an imbalance ratio (IR) exceeding . No feature engineering beyond extraction and labeling was reported for the dataset itself; subsequent research commonly applies dimensionality reduction or new embeddings for downstream tasks (Talukder et al., 2024).
Preprocessing steps found in ML benchmarking pipelines include:
- Removal of rows with missing or infinite values.
- Deduplication and column name sanitization.
- Integer and float downcasting for memory efficiency.
- Z-score standardization: applied to numeric features.
- Label encoding for class labels.
- Random oversampling (RO) of minority classes, achieving class balance before reduction or model training.
- Optional stacking of meta-features from clustering.
- Dimensionality reduction via PCA, typically down to 10 principal components (Talukder et al., 2024).
3. Dataset Size, Label Distribution, and Imbalance
The dataset comprises 2,830,743 flow records spanning 13 distinct classes. A tabular summary:
| Class | Count | Percentage |
|---|---|---|
| BENIGN | 2,273,097 | 80.30% |
| DoS Hulk | 231,073 | 8.16% |
| PortScan | 158,930 | 5.61% |
| DDoS | 128,027 | 4.52% |
| DoS GoldenEye | 10,293 | 0.36% |
| FTP-Patator | 7,938 | 0.28% |
| SSH-Patator | 5,897 | 0.21% |
| DoS slowloris | 5,796 | 0.20% |
| DoS slowhttptest | 5,499 | 0.19% |
| Web Attack | 2,180 | 0.08% |
| Bot | 1,966 | 0.07% |
| Infiltration | 36 | 0.00% |
| Heartbleed | 11 | 0.00% |
This pronounced class imbalance poses challenges for classifier training and evaluation, and is commonly addressed through synthetic minority oversampling or stratified cross-validation (Talukder et al., 2024). In downstream studies, highly imbalanced classes are sometimes merged or, in rigorous benchmarks, oversampled to match majority class counts.
4. Application in Intrusion Detection Pipelines
CIC-IDS2017 serves as a foundational evaluation dataset for both supervised and unsupervised network anomaly detection methods. Representative pipelines include:
- Flow-based supervised ML: Talukder et al. preprocess flows as described, apply random oversampling for class balancing, then reduce features via PCA before 10-fold cross-validation with classifiers such as decision trees (DT), random forests (RF), and extra trees (ET). This protocol yields flow-level detection accuracies up to 99.99% on balanced data (Talukder et al., 2024).
- Unsupervised multi-flow and behavior-level detection: BLADE, an unsupervised anomaly detection system, first corrects the original PCAP labeling using published errata. It groups TCP flows into multi-flow samples using a non-overlapping window (W=50), then extracts sequences of packet sizes, inter-arrival times, and TCP flags (each truncated/padded to length 50). Features are embedded using a Bidirectional GRU autoencoder and clustered into pseudo-operation classes via HDBSCAN. Anomaly scores calibrated on training losses are computed, aggregated with LogSumExp, and used alongside pseudo-operation labels and timestamps as input to a one-class SVM for final decision-making. Using this pipeline, BLADE achieves an F1 score of 0.9801 (DoS F1=0.9782, Botnet F1=0.9943, PortScan F1=0.9995, Web Attack F1=0.9472, Brute Force F1=0.9814) (Dong et al., 7 Nov 2025).
- Hybrid deep anomaly detection and RL: In the context of AI-driven firewall optimization, each flow is represented by a 78-dimensional, real-valued feature vector. LSTM-based temporal encoding, followed by 1D CNN and sigmoid classification, yields anomaly scores used both as detection outputs and for reward shaping in a deep reinforcement learning (DRL) agent. No explicit normalization or split strategy is provided; streaming and prioritized experience replay are used to address class imbalance by focusing on rare anomaly events (Ahmad, 21 May 2025).
5. Train/Test Splits, Validation Strategies, and Evaluation Metrics
Data partitioning varies by study:
- Standard ML pipelines employ stratified 10-fold cross-validation (90% train, 10% test per fold; no separate validation set) (Talukder et al., 2024).
- BLADE reserves 70% of benign multi-flow samples for training and 30% for testing, with all malicious multi-flow samples held out exclusively for test. No cross-validation or hyperparameter sweeps beyond architecture defaults are reported (Dong et al., 7 Nov 2025).
- RL-based pipelines do not utilize fixed splits, instead relying on streaming data to model online/continual learning (Ahmad, 21 May 2025).
Evaluation metrics include standard precision, recall, and F1-score for each attack class and macro-averaged metrics for overall performance (Dong et al., 7 Nov 2025). Supervised pipelines report accuracy per cross-validation fold (Talukder et al., 2024).
6. Known Issues, Dataset Corrections, and Practical Usage Notes
Multiple sources note the existence of labeling errors, packet misordering, and flow-level duplication in the original CIC-IDS2017 PCAPs. These are typically remediated by post-hoc corrections referenced in the literature (e.g., cicidsfix, liu2022error). The corrected flows improve training stability and ensure evaluation reflects truly benign versus anomalous behaviors (Dong et al., 7 Nov 2025). Data users are advised to implement these fixes prior to benchmarking.
In practical workflows, all preprocessing steps are codified in research pipelines, for example as pseudocode in Python using pandas, scikit-learn, and imbalanced-learn. The exact sequence—dropping nulls/infs, data type downcasting, scaling, encoding, oversampling, PCA—renders the dataset compatible with contemporary ML frameworks (Talukder et al., 2024).
7. Context, Impact, and Limitations
CIC-IDS2017 is widely adopted for benchmarking in intrusion detection, network anomaly detection, and behavioral analysis in cybersecurity. Its exhaustive attack coverage, real-world-like traffic, and comprehensive flow features facilitate robust empirical validation. However, extreme class imbalance and historical label inconsistency necessitate rigorous preprocessing. A plausible implication is that results derived from the dataset, especially on rare attack types, are sensitive to the balancing and correction methodologies employed. Furthermore, only flow-level labels are canonical; behavior-level ground truth must be synthesized by aggregation.
Continued citations in contemporary work underscore its relevance in evaluating both traditional ML classifiers and deep, streaming, or unsupervised models, spanning SDN security, dynamic firewall optimization, and multi-flow detection paradigms (Ahmad, 21 May 2025, Talukder et al., 2024, Dong et al., 7 Nov 2025).