High-Res Honeynet Dataset (MURHCAD)
- The dataset is a comprehensive, high-resolution repository capturing over 132,000 cyberattack events with per-second timestamps across multiple global cloud regions.
- It provides enriched metadata, including geolocation, protocol, ASN details, and derived temporal features, facilitating advanced statistical and ML analyses.
- The dataset supports practical research workflows in anomaly detection, protocol misuse studies, and defensive policy simulations with reproducible, scalable tools.
A high-resolution honeynet dataset is a comprehensive, time-granular, and context-rich collection of cyberattack event records captured by instrumented decoy systems (“honeypots”) designed to mimic vulnerable services and attract real-world adversaries. The Multi-Regional Cloud Honeynet Dataset (MURHCAD) exemplifies this approach, offering a global, multi-platform resource with detailed metadata, enabling reproducible and scalable analysis of cyber threat behavior across temporal, spatial, and protocol dimensions (Feito-Casares et al., 9 Jan 2026).
1. Dataset Structure and Deployment
MURHCAD was assembled over a continuous 72-hour window (June 9–11, 2025), capturing 132,425 discrete attack events from three honeypot platforms—Cowrie (Telnet/SSH), Dionaea (SMB and related services), and SentryPeer (SIP flood detection)—across four geographically distributed Microsoft Azure regions (Central India, Central US, Spain Central, South Africa North). Each of the four virtual machines hosted all three honeypot types, yielding 12 sensor instances. Spatial diversity was established by explicit VM placement on disparate cloud regions and event-level annotation with destination latitude/longitude and VM identifiers.
2. Temporal Resolution and Event Annotation
Event timestamps are recorded at 1-second granularity over the interval h ( 00:00:00 UTC to 23:59:59 UTC), supporting precise computation of inter-arrival times (), hourly rates , and diurnal trends. The overall event rate is events/hour. Derived features include the hour-of-day, day, and weekday, facilitating analyses of temporal periodicity and attack rush-hour detection (peaks at 07:00 and 23:00 UTC daily). Maintenance-induced measurement gaps—observed as —are present, corresponding to scheduled VM restarts.
3. Metadata Enrichment and Data Schema
Each event record consists of both canonical and derived fields:
- Core fields: UTC timestamp, attackType (honeypot platform), protocol (standardized label), srcIp/dstIp, srcPort/dstPort, srcASN/srcOrg, srcCountryName/dstCountryName, srcLat/srcLon, dstLat/dstLon, dstHostname, and dstIpInternal.
- Derived fields: temporal bucket (hour, day, weekday), anomaly flag ( iff or in top 1%), standardized protocol mapping, and entropy-based metrics (e.g., for source IP distribution).
- Format and access: The raw data is distributed as JSON batches; the preprocessed, analysis-ready version in CSV (HoneyNetEvents_Clean.csv) and, optionally, Parquet for scalable analytics. Schemas adhere to Avro/Parquet conventions.
Data ingestion and loading are illustrated with Python (pandas) and R (readr/tidyverse) code snippets, supporting out-of-the-box integration into standard data science workflows.
4. Statistical Characteristics and Attack Patterns
MURHCAD contains:
- total events, unique source IPs, and 13 recognized protocol labels.
- Protocol prevalence is highly skewed: SIP (41.6%, $55,060$ events, primarily via SentryPeer), Telnet (21.9%, events, Cowrie), SMB (27.2%, events, Dionaea). Minor protocols (HTTPD, MySQLD) together constitute <10%.
- The top 1% of source IPs (~24 addresses) account for 15% of event volume.
- Source IPs originate from 95 countries, with spatial “hotspots” in the United States, Western Europe, and Southeast Asia.
- Ports: srcPort mean = 48,604 (σ = 15,328), dstPort mean = 2,693 (σ = 3,933), with the 75th percentile at 5060.
- Temporal metrics: mean event hour UTC, h, IQR = [6, 18].
- Skewness, entropy, and per-hour attack rates () are explicitly defined and provided for advanced feature engineering.
Notable biases are observed: SentryPeer collects SIP floods in North America and Southeast Asia, Cowrie attracts Telnet/SSH scans from Western Europe and the US, while Dionaea records SMB exploits tightly focused on European nodes.
5. Recommended Research Workflows and Applications
MURHCAD is engineered as a resource for:
- Anomaly detection: With metadata-rich and temporally resolved data, IsolationForest or similar algorithms can be applied to (hour, dstPort, protocol) feature matrices.
- Protocol misuse studies: Time-series clustering by protocol enables the examination of misuse and attack campaigns across regions and honeypot types.
- Threat intelligence: ASN and organizational enrichment facilitate high-volume ASN identification, country/region mapping, and behavioral profiling.
- Defensive policy simulation: Empirical experimentations with firewall rules (e.g., blocking top 1% of source IPs) allow for direct measurement of resulting changes in (attack rate).
- Visualization: Built-in code snippets enable plotting of diurnal attack patterns and interactive geospatial mapping of source locations.
For data loading, feature extraction, and exploratory analysis, researchers are provided with Jupyter notebooks and infrastructure-as-code templates to ensure reproducibility.
6. Comparative Context and Best Practices
A comparison to representative honeynet datasets is summarized:
| Dataset | Temporal Resolution | Metadata Richness | Geographic Scope |
|---|---|---|---|
| Hornet 40/65 Niner | ≤5 min | Flow-record only, limited fields | Multi-region, cloud |
| CTU Honeynet | PCAP, detailed | Single-region, short duration | Single site |
| MURHCAD | 1 second | ASN, geolocation, protocol, host | 4 Azure regions |
MURHCAD distinguishes itself by combining high temporal granularity, rich event annotation, multi-region deployment, and focused protocol diversity. Researchers are advised to:
- Treat scheduled maintenance windows as missing data or model for those intervals.
- Counteract honeypot bias by aggregating across all sensors/platforms for balanced protocol/region representation.
- Normalize heavy-tailed distributions (such as source IP frequencies or event sizes) via log-transform for robust machine learning.
- Extend MURHCAD through longer deployments, inclusion of additional honeypot types (e.g., HTTP, DNS), or integration with “real” network flows for hybrid supervised tasks.
7. Broader Significance and Future Developments
The high-resolution, multi-regional honeynet dataset model, as instantiated by MURHCAD, provides a critical benchmark for anomaly detection, threat intelligence, and policy development in modern cloud and distributed network environments. Its combination of per-second timestamps, annotated geolocation, and synchronized multiplatform logging enables both granular and strategic analyses of cyberattack trends globally. Such datasets are expected to underpin reproducibility and comparability in cyberthreat research moving forward, especially when integrated with open-source preprocessing code and standardized schemas (Feito-Casares et al., 9 Jan 2026).
This suggests that future datasets in the domain should aim for increased duration, greater protocol/service representation, and systematic enrichment to continue supporting advanced machine learning, time-series modeling, and robust empirical defense evaluation.