Papers
Topics
Authors
Recent
Search
2000 character limit reached

Real-IAD Dataset: Industrial Anomaly Detection

Updated 13 November 2025
  • Real-IAD dataset is a real-world acquisition and benchmarking resource for industrial anomaly detection, combining real and synthetic VM traces.
  • It collects synchronized time-series metrics at 1 Hz from cloud VMs and injects CPU contention faults to simulate realistic anomaly conditions.
  • Evaluation using the dataset shows an average F1-score of 83.7%, outperforming traditional methods by 11% in indirect anomaly detection.

The Real-IAD dataset is a real-world acquisition and benchmarking resource specifically developed for industrial anomaly detection (IAD). Designed to bridge the gap between synthetic laboratory benchmarks and complex, uncontrolled cloud-industrial environments, Real-IAD enables evaluation of indirect VMM anomaly detection algorithms based solely on VM-level resource monitoring. The dataset comprises time-series metrics from both actual cloud VMs and merged synthetic traces, supporting rigorous quantitative assessment of anomaly detectors in infrastructure-as-a-service settings (Jindal et al., 2021).

1. Dataset Acquisition and Composition

The Real-IAD dataset was collected using virtualized cloud infrastructure on Google Cloud Platform. Each experimental run deployed a physical host VM (n1-standard-4: 4 vCPUs, 15 GiB RAM, Ubuntu 18.04) running KVM (libvirt) as the Virtual Machine Monitor (VMM). Nesting two guest VMs per host (VM₁: 2 vCPUs, 2 GiB RAM; VM₂: 1 vCPU, 1 GiB RAM), both hosting lightweight cloud-native web apps to induce realistic steady CPU and network loads, provided the basis for “real” traces.

Resource consumption metrics were recorded at 1 Hz via the Prometheus node_exporter agent, both at host and guest levels, with synchronized time-series ingested into a centralized Prometheus server. Anomaly events were injected on the VMM using stress-ng to create CPU-contention faults of random duration (1–3 min), separated by idle zones of at least 5 min. For each VMM, the dataset merges the two real VM resource traces with eight synthetic VM traces generated by a controlled Test Module, yielding a uniform group size (10 VMs/VMM), totalizing 54,000 records per VMM per run.

2. Data Schema and Recorded Metrics

Metric acquisition proceeded at a fixed frequency (Δt=1\Delta t=1s, fs=1f_s=1Hz) over 90-min intervals (n=5,400n=5,400 ticks/VM). Four primary features were retained:

  • CPU utilization (XtjX_t^j):

Xtj=100×cpu_busy_tickstjtotal_tickstjX_t^j = 100 \times \frac{\text{cpu\_busy\_ticks}_t^j}{\text{total\_ticks}_t^j}

  • Memory utilization (MtjM_t^j):

Mtj=100×used_memorytjtotal_memorytjM_t^j = 100 \times \frac{\text{used\_memory}_t^j}{\text{total\_memory}_t^j}

  • Disk I/O (Dtread,j,Dtwrite,jD_t^{\mathrm{read},j}, D_t^{\mathrm{write},j}):

Dtread,j=Δ(read_bytestj)Δt,Dtwrite,j=Δ(write_bytestj)ΔtD_t^{\mathrm{read},j} = \frac{\Delta (\text{read\_bytes}_t^j)}{\Delta t},\quad D_t^{\mathrm{write},j} = \frac{\Delta (\text{write\_bytes}_t^j)}{\Delta t}

  • Network throughput (Ntrx,j,Nttx,jN_t^{\mathrm{rx},j}, N_t^{\mathrm{tx},j}):

Ntrx,j=Δ(rx_bytestj)Δt,Nttx,j=Δ(tx_bytestj)ΔtN_t^{\mathrm{rx},j} = \frac{\Delta (\text{rx\_bytes}_t^j)}{\Delta t},\quad N_t^{\mathrm{tx},j} = \frac{\Delta (\text{tx\_bytes}_t^j)}{\Delta t}

Each row in the CSV-formatted dataset includes VMM_ID, VM_ID, timestamp (ISO 8601), each metric, and a ground-truth label (0=normal, 1=anomaly). Across all experimental runs, the released dataset contains 42 anomalous VMMs, 17 non-anomalous VMMs, each with 10 VMs per VMM, yielding a total record count of 3,186,000 ((42+17)×54,000(42+17) \times 54,000). The complete dataset size is approximately 120 MB.

3. Anomaly Protocol and Labeling

Fault intervals on the VMM were randomly selected with durations τiU[60,180]\tau_i \sim U[60,180]s; ground-truth labels are interval-based: yt={1,if t within a host-injection interval 0,otherwisey_t = \begin{cases} 1, & \text{if } t \text{ within a host-injection interval} \ 0, & \text{otherwise} \end{cases}

IAD labeling semantics: a VMM is flagged as anomalous at tick tt if 90%\geq 90\% of its VMs reported a change-point in their CPU time-series within the previous 60 seconds, quantifying indirect anomaly impact propagation in virtualized infrastructure. Evaluation metrics are precision, recall, and F1-score, explicitly: precision=TPTP+FP\text{precision} = \frac{TP}{TP+FP}

recall=TPTP+FN\text{recall} = \frac{TP}{TP+FN}

F1=2precisionrecallprecision+recallF_1 = 2 \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}

4. Preprocessing, Synthetic Data, and Normalization

To harmonize real and synthetic data, all percentage metrics are clipped to [0,100][0,100]. Synthetic VM traces are drawn from Gaussian distributions parameterized by real VM statistics in normal periods, then spliced with fault intervals following the same fault protocol. They are z-normalized within each experimental interval (90 min window) to ensure distributional consistency and support online anomaly detection algorithms. Global mean and variance statistics for anomaly detectors are computed using numerically stable one-pass algorithms (Knuth's/Welford's formulas).

5. Summary Statistics

Aggregated mean and variance across all VMs and all VMMs for CPU utilization are:

Period μCPU(%)\mu_{\text{CPU}}(\%) σCPU(%)\sigma_{\text{CPU}}(\%)
Normal 32.5 5.1
Anomalous 11.2 3.4

The high-level dataset composition is summarized as:

Set #Anomalous VMMs #Non-anomalous VMMs #VMs/VMM #ticks/VM
Real-IAD (Exp-Synthetic) 42 17 10 5400

This quantifies the prevalence of anomalies and statistical separability of fault intervals based on resource metrics.

6. Benchmarking and Practical Utility

The dataset provides a quantitative foundation for benchmarking indirect VMM anomaly detection algorithms such as IAD, which operates strictly on guest VM resource monitoring data, circumventing direct VMM access constraints. Algorithms are evaluated with precision, recall, and F1 averaged across all experimental runs. The IAD method achieves an average F1-score of 83.7% and outperforms conventional machine learning methods by 11% (mean F1) (Jindal et al., 2021).

7. Data Release, Format, and Recommendations

The dataset is distributed in CSV (comma-separated values) files, compatible with standard statistical and machine learning toolchains. The schema supports direct ingestion for time-series analysis and anomaly labeling, and, given the modest file size (120 MB), is suitable for rapid prototyping. Code to reproduce the experimental environment and synthetic data generator is available on the referenced GitHub repository (https://github.com/[org]/Real-IAD-Dataset).

For extension or benchmarking, researchers should adhere to the existing anomaly injection and detection protocols, ensuring the use of all four recorded metrics and the evaluation window scheme. The dataset supports simulation of realistic anomaly propagation in cloud-virtualized industrial settings, facilitating advancements in indirect fault diagnosis methodologies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Real-IAD Dataset.