Carbon Aware Cybersecurity Traffic Dataset is a publicly available collection of network flow records enriched with real-time energy and carbon metrics for eco-aware anomaly detection research.
It contains 2,300 flow-level records with stratified train/test splits and engineered features, balanced via SMOTE to support robust machine learning experiments.
The dataset integrates CodeCarbon instrumentation to measure energy use, CO₂ emissions, and operational costs, facilitating eco-efficiency benchmarking of cybersecurity models.
The Carbon Aware Cybersecurity Traffic Dataset is a publicly available collection of flow-level network observations specifically designed for empirical research at the intersection of machine learning-based anomaly detection and sustainability, with explicit annotation of real-time energy and carbon metrics. Developed to support eco-aware intrusion detection, the dataset enables benchmarking of cybersecurity algorithms under both performance and environmental cost constraints, reflecting emergent green computing and federal energy-efficiency initiatives in the US (Aashish et al., 31 Dec 2025).
1. Dataset Composition and Feature Taxonomy
The dataset comprises 2,300 flow-level records, each corresponding to a single network flow observation. Flows are labeled into two classes: “Normal” (status = 0) and “Anomalous” (status = 1). Following a stratified 80/20 train/test split, class imbalance in the training portion is remedied via Synthetic Minority Over-Sampling (SMOTE), resulting in balanced classes (50%/50%) for algorithm training.
Each observation contains a structured multi-domain feature set:
power_consumption_watts, carbon_emission_gCO2eq, energy_cost_usd, pue
W, gCO₂eq, USD, dimensionless
All data is provided in CSV format (one row per flow; columns as above), facilitating direct loading into research workflows (e.g., via pandas).
Dataset preprocessing involves integrity checks (no missing data), label encoding for categorical variables, engineered features (bytes_per_packet, payload_entropy_x_size, resource_util_sum, power_per_vm), standard scaling, stratified splitting, SMOTE oversampling (training set), and, optionally, Principal Component Analysis (PCA) retaining at least 90% explained variance.
2. Carbon, Energy, and Cost Attribution
Energy and carbon accountability within the dataset is achieved via real-time logging in a controlled Google Colab environment, with the entire experimental pipeline instrumented using the CodeCarbon toolkit (Fischer & Lamarr Institute, 2025). CodeCarbon captures:
CPU and GPU utilization, memory load (real-time sampling)
Translation of hardware utilization to power consumption (kWh) via device-specific coefficients
Multiplication by a standardized US grid average carbon intensity (ICO2≈417 g CO₂eq/kWh)
Estimation of monetary energy cost via the prevailing US electricity rate ($C_{\$/kWh}\approx \$0.13/kWh)</li></ul><p>Eachmodeltraining/inferencerunresultsinastructured.csvoutputwithtrainingtime,inferencetime,energy(kWh),CO2emissions(g),andenergycost(USD),enablingtransparentevaluationofbothoperationalandenvironmentalefficiency.</p><h2class=′paper−heading′id=′eco−efficiency−index′>3.Eco−EfficiencyIndex</h2><p>TheEco−EfficiencyIndex(<ahref="https://www.emergentmind.com/topics/energy−exchanges−integrator−eei"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">EEI</a>)isintroducedasaprimarymetricforquantifyingthetrade−offbetweenanomalydetectionefficacyandenergyconsumption.Itisdefinedas:</p><p>\mathrm{EEI} = \frac{\text{F1-score}}{\text{Energy Consumption (kWh)} + \varepsilon}</p><p>whereF1−scoreistheharmonicmeanofprecisionandrecall,energyconsumptionisasmeasuredduringtrainingorinference,and\varepsilonisasmallconstant(10^{-8})toavoiddivisionbyzero.AhigherEEIdenotesgreateranomalydetectioneffectivenessperunitofenergyexpended.Thismetricallowsrigorouscomparisonofdisparatedetectionarchitecturesirrespectiveofabsoluteresourcescale.</p><h2class=′paper−heading′id=′model−benchmarking−and−principal−findings′>4.ModelBenchmarkingandPrincipalFindings</h2><p>Thedatasethasbeenusedtobenchmarkmultiplecanonicaldetectionalgorithms:LogisticRegression,RandomForest,SupportVectorMachine,IsolationForest,and<ahref="https://www.emergentmind.com/topics/extreme−gradient−boosting−xgboost−classifiers"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">XGBoost</a>.Allmodelsareevaluatedacrossconventionaldetectionmetricsandsustainabilitydimensions.</p><p>Notableempiricalinsightsinclude:</p><ul><li>Trainingphasedominatesenergyandcarbonoutputcomparedtoinference.</li><li>OptimizedRandomForestandlightweightLogisticRegressionmodelsachievethehighesteco−efficiency,reducingenergyconsumptionbyover40<li>PCA−driven<ahref="https://www.emergentmind.com/topics/dimensionality−reduction−umap"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">dimensionalityreduction</a>(reducing20featuresto8,≥90<ul><li>RandomForest:accuracyimprovementfrom0.739to0.769,F1upto0.753,CO2emissionsreducedbyanorderofmagnitude(approximately0.004 gvs.0.055 g),andnegligiblerecallloss(±1–2</ul></li><li>RepresentativeenergyandCO2records(pertrainingrun):LogisticRegression(\approx 2.7\times10^{-11}kWh,1\times10^{-4}gCO2eq),IsolationForest(1.3\times10^{-9}kWh,4.9\times10^{-3}g),XGBoost(2.5\times10^{-3}gCO2eq,F1≈0.74),RandomForest(highestCO2≈0.055 g,F1≈0.739).</li></ul><p>Aplausibleimplicationisthatmodelselectionand<ahref="https://www.emergentmind.com/topics/pipeline−optimization"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">pipelineoptimization</a>(featureselection,PCA)canyieldsubstantialenergyandcarbonreductionswhilepreservingdetectioncoverage(<ahref="/papers/2601.00893"title=""rel="nofollow"data−turbo="false"class="assistant−link"x−datax−tooltip.raw="">Aashishetal.,31Dec2025</a>).</p><h2class=′paper−heading′id=′carbon−and−energy−computational−methodology′>5.CarbonandEnergyComputationalMethodology</h2><p>Thedataset’ssustainabilitymetricsaregroundedinthefollowingprotocols:</p><ul><li><strong>Real−timepowerestimation:</strong>CodeCarboncontinuouslysamplesCPU/GPU/memoryutilization,mappingeachtocorrespondingpowercoefficients(P_{\mathrm{cpu}},P_{\mathrm{gpu}},etc.).TotalenergyisintegratedasE_{\mathrm{train}} \approx P_{\mathrm{avg}} \times T_{\mathrm{train}}[kWh].</li><li><strong>CO2emissionsattribution:</strong>Usingtheformula</li></ul><p>\mathrm{CO2}_{\mathrm{g}} = E_{\text{kWh}} \times I_{\rm CO2}</p><p>withI_{\rm CO2}setbasedonU.S.nationalgridaverages.</p><ul><li><strong>Monetaryenergycost:</strong>Modeledas</li></ul><p>\text{energy\_cost\_usd} = E_{\text{kWh}} \times C_{\$/\text{kWh}}</p><p>withC_{\$/\text{kWh}}$ reflecting the contemporary US rate.
All environmental metrics are reported alongside conventional observational features, supporting multi-objective evaluation.
6. Applications and Methodological Recommendations
The Carbon Aware Cybersecurity Traffic Dataset supports several research directions:
Benchmarking novel anomaly-detection frameworks (including deep learning or graph-based IDSs) in terms of both predictive and environmental efficiency.
Experimentation with adaptive detection systems that modulate algorithm selection based on real-time grid carbon intensity.
Extension to resource-constrained edge/IoT domains (e.g., Raspberry Pi), especially where CPU/GPU mix and Power Usage Effectiveness (PUE) fluctuate.
Implementation of multi-objective model optimization using the Eco-Efficiency Index and/or Pareto fronts to jointly maximize detection performance and minimize carbon footprint.
Systematic reporting of both traditional detection metrics and sustainability metrics (energy_kWh, CO₂_g) is recommended; life-cycle assessment (LCA) can be considered for incorporating hardware manufacturing and disposal overheads beyond runtime emissions.
By incorporating this dataset within experimental workflows, with consistent CodeCarbon-based tracking and EEI-based assessment, research can advance toward a reproducible, carbon-accountable cybersecurity paradigm (Aashish et al., 31 Dec 2025).
7. Access and Reproducibility
The dataset is described as “publicly available,” with distribution typically colocated with publication supplementary materials or the authors’ Github. The data is in standard CSV format; compatibility is ensured with mainstream analytical workflows via tools such as pandas.read_csv. The controlled Google Colab environment and CodeCarbon instrumentation offer reproducibility, enabling direct performance-to-carbon trade-off quantification under shared experimental assumptions (Aashish et al., 31 Dec 2025).