Papers
Topics
Authors
Recent
Search
2000 character limit reached

Carbon Aware Cybersecurity Traffic Dataset

Updated 7 January 2026
  • Carbon Aware Cybersecurity Traffic Dataset is a publicly available collection of network flow records enriched with real-time energy and carbon metrics for eco-aware anomaly detection research.
  • It contains 2,300 flow-level records with stratified train/test splits and engineered features, balanced via SMOTE to support robust machine learning experiments.
  • The dataset integrates CodeCarbon instrumentation to measure energy use, CO₂ emissions, and operational costs, facilitating eco-efficiency benchmarking of cybersecurity models.

The Carbon Aware Cybersecurity Traffic Dataset is a publicly available collection of flow-level network observations specifically designed for empirical research at the intersection of machine learning-based anomaly detection and sustainability, with explicit annotation of real-time energy and carbon metrics. Developed to support eco-aware intrusion detection, the dataset enables benchmarking of cybersecurity algorithms under both performance and environmental cost constraints, reflecting emergent green computing and federal energy-efficiency initiatives in the US (Aashish et al., 31 Dec 2025).

1. Dataset Composition and Feature Taxonomy

The dataset comprises 2,300 flow-level records, each corresponding to a single network flow observation. Flows are labeled into two classes: “Normal” (status = 0) and “Anomalous” (status = 1). Following a stratified 80/20 train/test split, class imbalance in the training portion is remedied via Synthetic Minority Over-Sampling (SMOTE), resulting in balanced classes (50%/50%) for algorithm training.

Each observation contains a structured multi-domain feature set:

Feature Type Features (examples) Units / Description
Network-Traffic packet_count, byte_count, flow_duration, protocol_type, src_port, dst_port, avg_pkt_size, payload_entropy, connection_state Counts, bytes, s, encoded
System-Utilization cpu_usage, memory_usage, disk_io, network_io, vm_count %, MB/s, count
Sustainability power_consumption_watts, carbon_emission_gCO2eq, energy_cost_usd, pue W, gCO₂eq, USD, dimensionless

All data is provided in CSV format (one row per flow; columns as above), facilitating direct loading into research workflows (e.g., via pandas).

Dataset preprocessing involves integrity checks (no missing data), label encoding for categorical variables, engineered features (bytes_per_packet, payload_entropy_x_size, resource_util_sum, power_per_vm), standard scaling, stratified splitting, SMOTE oversampling (training set), and, optionally, Principal Component Analysis (PCA) retaining at least 90% explained variance.

2. Carbon, Energy, and Cost Attribution

Energy and carbon accountability within the dataset is achieved via real-time logging in a controlled Google Colab environment, with the entire experimental pipeline instrumented using the CodeCarbon toolkit (Fischer & Lamarr Institute, 2025). CodeCarbon captures:

  • CPU and GPU utilization, memory load (real-time sampling)
  • Translation of hardware utilization to power consumption (kWh) via device-specific coefficients
  • Multiplication by a standardized US grid average carbon intensity (ICO2417I_{\rm CO2} \approx 417 g CO₂eq/kWh)
  • Estimation of monetary energy cost via the prevailing US electricity rate ($C_{\$/kWh}\approx \$0.13/kWh)</li></ul><p>Eachmodeltraining/inferencerunresultsinastructured.csvoutputwithtrainingtime,inferencetime,energy(kWh),CO2emissions(g),andenergycost(USD),enablingtransparentevaluationofbothoperationalandenvironmentalefficiency.</p><h2class=paperheadingid=ecoefficiencyindex>3.EcoEfficiencyIndex</h2><p>TheEcoEfficiencyIndex(<ahref="https://www.emergentmind.com/topics/energyexchangesintegratoreei"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">EEI</a>)isintroducedasaprimarymetricforquantifyingthetradeoffbetweenanomalydetectionefficacyandenergyconsumption.Itisdefinedas:</p><p>/kWh)</li> </ul> <p>Each model training/inference run results in a structured .csv output with training time, inference time, energy (kWh), CO₂ emissions (g), and energy cost (USD), enabling transparent evaluation of both operational and environmental efficiency.</p> <h2 class='paper-heading' id='eco-efficiency-index'>3. Eco-Efficiency Index</h2> <p>The Eco-Efficiency Index (<a href="https://www.emergentmind.com/topics/energy-exchanges-integrator-eei" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">EEI</a>) is introduced as a primary metric for quantifying the trade-off between anomaly detection efficacy and energy consumption. It is defined as:</p> <p>\mathrm{EEI} = \frac{\text{F1-score}}{\text{Energy Consumption (kWh)} + \varepsilon}</p><p>whereF1scoreistheharmonicmeanofprecisionandrecall,energyconsumptionisasmeasuredduringtrainingorinference,and</p> <p>where F1-score is the harmonic mean of precision and recall, energy consumption is as measured during training or inference, and \varepsilonisasmallconstant( is a small constant (10^{-8})toavoiddivisionbyzero.AhigherEEIdenotesgreateranomalydetectioneffectivenessperunitofenergyexpended.Thismetricallowsrigorouscomparisonofdisparatedetectionarchitecturesirrespectiveofabsoluteresourcescale.</p><h2class=paperheadingid=modelbenchmarkingandprincipalfindings>4.ModelBenchmarkingandPrincipalFindings</h2><p>Thedatasethasbeenusedtobenchmarkmultiplecanonicaldetectionalgorithms:LogisticRegression,RandomForest,SupportVectorMachine,IsolationForest,and<ahref="https://www.emergentmind.com/topics/extremegradientboostingxgboostclassifiers"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">XGBoost</a>.Allmodelsareevaluatedacrossconventionaldetectionmetricsandsustainabilitydimensions.</p><p>Notableempiricalinsightsinclude:</p><ul><li>Trainingphasedominatesenergyandcarbonoutputcomparedtoinference.</li><li>OptimizedRandomForestandlightweightLogisticRegressionmodelsachievethehighestecoefficiency,reducingenergyconsumptionbyover40<li>PCAdriven<ahref="https://www.emergentmind.com/topics/dimensionalityreductionumap"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">dimensionalityreduction</a>(reducing 20featuresto 8,90<ul><li>RandomForest:accuracyimprovementfrom0.739to0.769,F1upto0.753,CO2emissionsreducedbyanorderofmagnitude(approximately0.004gvs.0.055g),andnegligiblerecallloss(±12</ul></li><li>RepresentativeenergyandCO2records(pertrainingrun):LogisticRegression() to avoid division by zero. A higher EEI denotes greater anomaly detection effectiveness per unit of energy expended. This metric allows rigorous comparison of disparate detection architectures irrespective of absolute resource scale.</p> <h2 class='paper-heading' id='model-benchmarking-and-principal-findings'>4. Model Benchmarking and Principal Findings</h2> <p>The dataset has been used to benchmark multiple canonical detection algorithms: Logistic Regression, Random Forest, Support Vector Machine, Isolation Forest, and <a href="https://www.emergentmind.com/topics/extreme-gradient-boosting-xgboost-classifiers" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">XGBoost</a>. All models are evaluated across conventional detection metrics and sustainability dimensions.</p> <p>Notable empirical insights include:</p> <ul> <li>Training phase dominates energy and carbon output compared to inference.</li> <li>Optimized Random Forest and lightweight Logistic Regression models achieve the highest eco-efficiency, reducing energy consumption by over 40% relative to XGBoost, while preserving competitive F1 performance.</li> <li>PCA-driven <a href="https://www.emergentmind.com/topics/dimensionality-reduction-umap" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">dimensionality reduction</a> (reducing ~20 features to ~8, ≥90% variance) further decreases computational load, yielding: <ul> <li>Random Forest: accuracy improvement from 0.739 to 0.769, F1 up to 0.753, CO₂ emissions reduced by an order of magnitude (approximately 0.004 g vs. 0.055 g), and negligible recall loss (±1–2%).</li> </ul></li> <li>Representative energy and CO₂ records (per training run): Logistic Regression (\approx 2.7\times10^{-11}kWh, kWh, 1\times10^{-4}gCO2eq),IsolationForest( g CO₂eq), Isolation Forest (1.3\times10^{-9}kWh, kWh, 4.9\times10^{-3}g),XGBoost( g), XGBoost (2.5\times10^{-3}gCO2eq,F10.74),RandomForest(highestCO20.055g,F10.739).</li></ul><p>Aplausibleimplicationisthatmodelselectionand<ahref="https://www.emergentmind.com/topics/pipelineoptimization"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">pipelineoptimization</a>(featureselection,PCA)canyieldsubstantialenergyandcarbonreductionswhilepreservingdetectioncoverage(<ahref="/papers/2601.00893"title=""rel="nofollow"dataturbo="false"class="assistantlink"xdataxtooltip.raw="">Aashishetal.,31Dec2025</a>).</p><h2class=paperheadingid=carbonandenergycomputationalmethodology>5.CarbonandEnergyComputationalMethodology</h2><p>Thedatasetssustainabilitymetricsaregroundedinthefollowingprotocols:</p><ul><li><strong>Realtimepowerestimation:</strong>CodeCarboncontinuouslysamplesCPU/GPU/memoryutilization,mappingeachtocorrespondingpowercoefficients( g CO₂eq, F1 ≈ 0.74), Random Forest (highest CO₂ ≈ 0.055 g, F1 ≈ 0.739).</li> </ul> <p>A plausible implication is that model selection and <a href="https://www.emergentmind.com/topics/pipeline-optimization" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">pipeline optimization</a> (feature selection, PCA) can yield substantial energy and carbon reductions while preserving detection coverage (<a href="/papers/2601.00893" title="" rel="nofollow" data-turbo="false" class="assistant-link" x-data x-tooltip.raw="">Aashish et al., 31 Dec 2025</a>).</p> <h2 class='paper-heading' id='carbon-and-energy-computational-methodology'>5. Carbon and Energy Computational Methodology</h2> <p>The dataset’s sustainability metrics are grounded in the following protocols:</p> <ul> <li><strong>Real-time power estimation:</strong> CodeCarbon continuously samples CPU/GPU/memory utilization, mapping each to corresponding power coefficients (P_{\mathrm{cpu}},, P_{\mathrm{gpu}},etc.).Totalenergyisintegratedas, etc.). Total energy is integrated as E_{\mathrm{train}} \approx P_{\mathrm{avg}} \times T_{\mathrm{train}}[kWh].</li><li><strong>CO2emissionsattribution:</strong>Usingtheformula</li></ul><p> [kWh].</li> <li><strong>CO₂ emissions attribution:</strong> Using the formula</li> </ul> <p>\mathrm{CO2}_{\mathrm{g}} = E_{\text{kWh}} \times I_{\rm CO2}</p><p>with</p> <p>with I_{\rm CO2}setbasedonU.S.nationalgridaverages.</p><ul><li><strong>Monetaryenergycost:</strong>Modeledas</li></ul><p> set based on U.S. national grid averages.</p> <ul> <li><strong>Monetary energy cost:</strong> Modeled as</li> </ul> <p>\text{energy\_cost\_usd} = E_{\text{kWh}} \times C_{\$/\text{kWh}}</p><p>with</p> <p>with C_{\$/\text{kWh}}$ reflecting the contemporary US rate.

    All environmental metrics are reported alongside conventional observational features, supporting multi-objective evaluation.

    6. Applications and Methodological Recommendations

    The Carbon Aware Cybersecurity Traffic Dataset supports several research directions:

    • Benchmarking novel anomaly-detection frameworks (including deep learning or graph-based IDSs) in terms of both predictive and environmental efficiency.
    • Experimentation with adaptive detection systems that modulate algorithm selection based on real-time grid carbon intensity.
    • Extension to resource-constrained edge/IoT domains (e.g., Raspberry Pi), especially where CPU/GPU mix and Power Usage Effectiveness (PUE) fluctuate.
    • Implementation of multi-objective model optimization using the Eco-Efficiency Index and/or Pareto fronts to jointly maximize detection performance and minimize carbon footprint.
    • Systematic reporting of both traditional detection metrics and sustainability metrics (energy_kWh, CO₂_g) is recommended; life-cycle assessment (LCA) can be considered for incorporating hardware manufacturing and disposal overheads beyond runtime emissions.

    By incorporating this dataset within experimental workflows, with consistent CodeCarbon-based tracking and EEI-based assessment, research can advance toward a reproducible, carbon-accountable cybersecurity paradigm (Aashish et al., 31 Dec 2025).

    7. Access and Reproducibility

    The dataset is described as “publicly available,” with distribution typically colocated with publication supplementary materials or the authors’ Github. The data is in standard CSV format; compatibility is ensured with mainstream analytical workflows via tools such as pandas.read_csv. The controlled Google Colab environment and CodeCarbon instrumentation offer reproducibility, enabling direct performance-to-carbon trade-off quantification under shared experimental assumptions (Aashish et al., 31 Dec 2025).

    Definition Search Book Streamline Icon: https://streamlinehq.com
    References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Carbon Aware Cybersecurity Traffic Dataset.