Papers
Topics
Authors
Recent
Search
2000 character limit reached

High-Res Honeynet Dataset (MURHCAD)

Updated 16 January 2026
  • The dataset is a comprehensive, high-resolution repository capturing over 132,000 cyberattack events with per-second timestamps across multiple global cloud regions.
  • It provides enriched metadata, including geolocation, protocol, ASN details, and derived temporal features, facilitating advanced statistical and ML analyses.
  • The dataset supports practical research workflows in anomaly detection, protocol misuse studies, and defensive policy simulations with reproducible, scalable tools.

A high-resolution honeynet dataset is a comprehensive, time-granular, and context-rich collection of cyberattack event records captured by instrumented decoy systems (“honeypots”) designed to mimic vulnerable services and attract real-world adversaries. The Multi-Regional Cloud Honeynet Dataset (MURHCAD) exemplifies this approach, offering a global, multi-platform resource with detailed metadata, enabling reproducible and scalable analysis of cyber threat behavior across temporal, spatial, and protocol dimensions (Feito-Casares et al., 9 Jan 2026).

1. Dataset Structure and Deployment

MURHCAD was assembled over a continuous 72-hour window (June 9–11, 2025), capturing 132,425 discrete attack events from three honeypot platforms—Cowrie (Telnet/SSH), Dionaea (SMB and related services), and SentryPeer (SIP flood detection)—across four geographically distributed Microsoft Azure regions (Central India, Central US, Spain Central, South Africa North). Each of the four virtual machines hosted all three honeypot types, yielding 12 sensor instances. Spatial diversity was established by explicit VM placement on disparate cloud regions and event-level annotation with destination latitude/longitude and VM identifiers.

2. Temporal Resolution and Event Annotation

Event timestamps are recorded at 1-second granularity over the interval T=72T = 72 h (2025-06-092025\text{-}06\text{-}09 00:00:00 UTC to 2025-06-112025\text{-}06\text{-}11 23:59:59 UTC), supporting precise computation of inter-arrival times (Δti=ti+1ti\Delta t_i = t_{i+1} - t_i), hourly rates H(t)H(t), and diurnal trends. The overall event rate is λ1843\lambda \approx 1843 events/hour. Derived features include the hour-of-day, day, and weekday, facilitating analyses of temporal periodicity and attack rush-hour detection (peaks at 07:00 and 23:00 UTC daily). Maintenance-induced measurement gaps—observed as H(t)=0H(t) = 0—are present, corresponding to scheduled VM restarts.

3. Metadata Enrichment and Data Schema

Each event record consists of both canonical and derived fields:

  • Core fields: UTC timestamp, attackType (honeypot platform), protocol (standardized label), srcIp/dstIp, srcPort/dstPort, srcASN/srcOrg, srcCountryName/dstCountryName, srcLat/srcLon, dstLat/dstLon, dstHostname, and dstIpInternal.
  • Derived fields: temporal bucket (hour, day, weekday), anomaly flag (flagi=1flag_i = 1 iff H(t)μH>2σH|H(t) - \mu_H| > 2\sigma_H or srcIpsrcIp in top 1%), standardized protocol mapping, and entropy-based metrics (e.g., Hsrc=kpklog2pkH_{src} = -\sum_k p_k \log_2 p_k for source IP distribution).
  • Format and access: The raw data is distributed as JSON batches; the preprocessed, analysis-ready version in CSV (HoneyNetEvents_Clean.csv) and, optionally, Parquet for scalable analytics. Schemas adhere to Avro/Parquet conventions.

Data ingestion and loading are illustrated with Python (pandas) and R (readr/tidyverse) code snippets, supporting out-of-the-box integration into standard data science workflows.

4. Statistical Characteristics and Attack Patterns

MURHCAD contains:

  • N=132425N = 132\,425 total events, 24382\,438 unique source IPs, and 13 recognized protocol labels.
  • Protocol prevalence is highly skewed: SIP (41.6%, $55,060$ events, primarily via SentryPeer), Telnet (21.9%, 29,000\sim 29,000 events, Cowrie), SMB (27.2%, 36,000\sim 36,000 events, Dionaea). Minor protocols (HTTPD, MySQLD) together constitute <10%.
  • The top 1% of source IPs (~24 addresses) account for 15% of event volume.
  • Source IPs originate from 95 countries, with spatial “hotspots” in the United States, Western Europe, and Southeast Asia.
  • Ports: srcPort mean = 48,604 (σ = 15,328), dstPort mean = 2,693 (σ = 3,933), with the 75th percentile at 5060.
  • Temporal metrics: mean event hour μh=11.41\mu_h = 11.41 UTC, σh=7.19\sigma_h = 7.19 h, IQR = [6, 18].
  • Skewness, entropy, and per-hour attack rates (r(t)=H(t)/1hr(t) = H(t)/1\,h) are explicitly defined and provided for advanced feature engineering.

Notable biases are observed: SentryPeer collects SIP floods in North America and Southeast Asia, Cowrie attracts Telnet/SSH scans from Western Europe and the US, while Dionaea records SMB exploits tightly focused on European nodes.

MURHCAD is engineered as a resource for:

  • Anomaly detection: With metadata-rich and temporally resolved data, IsolationForest or similar algorithms can be applied to (hour, dstPort, protocol) feature matrices.
  • Protocol misuse studies: Time-series clustering by protocol enables the examination of misuse and attack campaigns across regions and honeypot types.
  • Threat intelligence: ASN and organizational enrichment facilitate high-volume ASN identification, country/region mapping, and behavioral profiling.
  • Defensive policy simulation: Empirical experimentations with firewall rules (e.g., blocking top 1% of source IPs) allow for direct measurement of resulting changes in λ\lambda (attack rate).
  • Visualization: Built-in code snippets enable plotting of diurnal attack patterns and interactive geospatial mapping of source locations.

For data loading, feature extraction, and exploratory analysis, researchers are provided with Jupyter notebooks and infrastructure-as-code templates to ensure reproducibility.

6. Comparative Context and Best Practices

A comparison to representative honeynet datasets is summarized:

Dataset Temporal Resolution Metadata Richness Geographic Scope
Hornet 40/65 Niner ≤5 min Flow-record only, limited fields Multi-region, cloud
CTU Honeynet PCAP, detailed Single-region, short duration Single site
MURHCAD 1 second ASN, geolocation, protocol, host 4 Azure regions

MURHCAD distinguishes itself by combining high temporal granularity, rich event annotation, multi-region deployment, and focused protocol diversity. Researchers are advised to:

  • Treat scheduled maintenance windows as missing data or model λ(t)=0\lambda(t) = 0 for those intervals.
  • Counteract honeypot bias by aggregating across all sensors/platforms for balanced protocol/region representation.
  • Normalize heavy-tailed distributions (such as source IP frequencies or event sizes) via log-transform for robust machine learning.
  • Extend MURHCAD through longer deployments, inclusion of additional honeypot types (e.g., HTTP, DNS), or integration with “real” network flows for hybrid supervised tasks.

7. Broader Significance and Future Developments

The high-resolution, multi-regional honeynet dataset model, as instantiated by MURHCAD, provides a critical benchmark for anomaly detection, threat intelligence, and policy development in modern cloud and distributed network environments. Its combination of per-second timestamps, annotated geolocation, and synchronized multiplatform logging enables both granular and strategic analyses of cyberattack trends globally. Such datasets are expected to underpin reproducibility and comparability in cyberthreat research moving forward, especially when integrated with open-source preprocessing code and standardized schemas (Feito-Casares et al., 9 Jan 2026).

This suggests that future datasets in the domain should aim for increased duration, greater protocol/service representation, and systematic enrichment to continue supporting advanced machine learning, time-series modeling, and robust empirical defense evaluation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to High-Resolution Honeynet Dataset.