MVTec AD: Anomaly Detection Benchmark

Updated 5 February 2026

MVTec AD dataset is a collection of high-resolution industrial images with pixel-precise annotations that capture diverse and subtle defect types.
It supports unsupervised anomaly detection by providing strictly partitioned training (normal) and test (mixed) sets to ensure robust model evaluation.
Extensions like MVTec AD 2 and MVTec 3D-AD introduce multi-modal data and lighting variations, further challenging current methods in real-world scenarios.

The MVTec AD dataset and its subsequent extensions, MVTec AD 2 and MVTec 3D-AD, constitute a family of rigorous benchmarks for unsupervised anomaly detection and localization in industrial visual inspection. Designed to expose the limitations of prevailing machine learning methods, these datasets provide high-resolution imagery (2D and 3D), pixel-precise ground truth annotations, and challenging scenarios featuring diverse, subtle, and real-world defect types. Their publicly available evaluation servers enforce strict blindness for test splits, supporting robust and comparable benchmarking across the research community (Heckler-Kram et al., 27 Mar 2025, Bergmann et al., 2021).

1. Original MVTec AD: Overview and Benchmarking

The original MVTec AD dataset comprises 5,354 images across 15 industrial categories, covering a broad spectrum of manufactured objects and textures, such as metal castings, bottles, and fabrics. All training and validation images represent anomaly-free instances, whereas test sets include both pristine and anomalous samples exhibiting defects such as scratches, dents, contaminations, and misprints.

Images are high-resolution (typically 1024×1024), captured under controlled bright-field illumination optimized for defect visibility. Anomalies span surface, structural, and contamination defects, with pixel-precise binary masks as ground truth. Label ambiguity, such as imprecise mask borders on tiny or low-contrast defects, has been noted as a factor that prevents models from achieving perfect segmentation scores (Zheng et al., 2022).

Benchmarking is performed using unsupervised anomaly detection methods—trained only on anomaly-free data—and evaluated via image-level and pixel-level metrics. With current approaches, segmentation AU-PRO (area under the per-region overlap) routinely reaches 92–97% at FPR≤0.3, indicating near-saturation on this dataset (Heckler-Kram et al., 27 Mar 2025).

2. From Saturation to Challenge: MVTec AD 2 Dataset

MVTec AD 2 was introduced to overcome the lack of discriminatory power in existing benchmarks, as small differences in segmentation AU-PRO (frequently <1 percentage point) hinder meaningful model comparison and research progress. It comprises 8,004 high-resolution images (2.6–5 MP) across eight carefully chosen industrial inspection scenarios, each designed to expose specific failure modes of current anomaly detection systems (Heckler-Kram et al., 27 Mar 2025):

Can: Detection of tiny print errors and scratches on reflective metal surfaces with variable rotation and low-contrast patterns.
Fabric: Printed patterns with high intra-class variance; defects include small holes, loose threads, and subtle color inconsistencies.
Fruit Jelly: Semi-transparent, variable-content objects subject to contamination and surface scratches.
Rice: Bulk arrangements with extremely low-contrast contaminants, including transparent plastics.
Sheet Metal: Dark-field illuminated, high-aspect-ratio parts with many small, randomly highlighted defects.
Vial: Fully transparent, refractive objects with defects ranging from foreign bodies to fill-level errors.
Wall Plugs: Overlapping, variably-placed bulk parts with occlusions and border defects.
Walnuts: Highly variable natural shapes, cracks, contaminations, and bulk overlaps.

Splits are strictly organized: only normal images in training/validation; multi-modal test splits include both seen and unseen lighting conditions to assess robustness under distribution shift. Each test split is available either with public ground truth (for example images) or as “private” (hidden ground truth evaluated via server).

3. Annotation Protocols, Data Organization, and Evaluation Metrics

Annotations consist of pixel-precise binary masks. The process entails on-site coarse delineation refined with object-specific heuristics per scenario. Subsequent manual review and automated consistency verification (to detect holes and spurious small regions) are standard. For each scenario, image data and ground-truth masks are organized by object type, split, and defect label.

The principal evaluation metric is segmentation AU-PRO (“area under the per-region overlap”), defined as:

$\mathrm{PRO}(t) = \frac{1}{k} \sum_{i=1}^{|D_{\text{test}}|} \sum_{j=1}^{k_i} \frac{|P_{\text{ano}}(t) \cap C_{i,j}|}{|C_{i,j}|}$

where $C_{i,j}$ denotes the $j$ -th ground-truth region in test image $i$ , $P_{\text{ano}}(t)$ the predicted anomalous pixels at threshold $t$ , and $k$ the total number of regions. The area is then integrated with respect to FPR up to 0.05:

$\mathrm{AU\text{-}PRO} = \int_{\text{FPR}=0}^{0.05} \mathrm{PRO}(\text{FPR})\,d\text{FPR}$

Secondary metrics include pixel-level F1 (using a threshold of $\mu_\text{val} + 3\sigma_\text{val}$ from anomaly-free validation images), image-level F1 (an image flagged as defective if any pixel surpasses threshold), and runtime/memory profiling (batch size 1, RTX 2080 Ti, float32) to assess scalability.

4. State-of-the-Art Benchmarks and Performance Landscape

Benchmark results on MVTec AD 2 (private test1, same-lighting, 256×256 input) highlight the sharp increase in challenge:

Method	[email protected] (%)	Pixel-F1 (%)	Image-F1 (%)
EfficientAD	30.8	N/A	N/A
PatchCore	28.8	(∼3.7)	N/A
RD++	27.1	N/A	N/A
RD	26.4	N/A	N/A
MSFlow	24.3	21.8	N/A
SimpleNet	21.1	N/A	N/A
DSR	20.3	N/A	∼78

Per-scenario PRO varies considerably: Vial (≈62%) is notably easier, while Can remains extremely challenging (4–9%). Threshold-dependent pixel-F1 and image-F1 metrics confirm that model ranking can shift with different binarization settings, emphasizing the value of threshold-independent AU-PRO. Results demonstrate that, in contrast to the original MVTec AD, no method achieves over 31% [email protected], and even at the relaxed FPR=0.3 threshold, state-of-the-art methods remain below 60% (Heckler-Kram et al., 27 Mar 2025).

5. Robustness to Real-World Distribution Shift

The inclusion of multi-lighting test splits (private test2) introduces real-world variability—over/under exposure, additional or shifted spotlights, color temperature changes—providing the first non-synthetic assessment of anomaly detection robustness under such shifts. Methods exhibit a range of robustness drops, with some (e.g., EfficientAD, MSFlow) showing over 10 percentage points decrease in AU-PRO, while RD and PatchCore are more resilient (≤ 3 pp drop). This property is critical for industrial deployment, where lighting conditions can frequently shift (Heckler-Kram et al., 27 Mar 2025).

6. Challenges in 3D Anomaly Detection: MVTec 3D-AD

MVTec 3D-AD targets the unsolved challenge of geometric anomaly detection using true three-dimensional data. High-resolution depth scans (1920×1200 px) from a structured-light camera capture ten diverse object categories, from natural (bagel, carrot, peach, potato) to deformable (foam, tire, rope) and rigid manufactured parts (cable gland, wooden dowel).

Key dataset properties include:

Modalities: Per-pixel (x, y, z) coordinates, aligned RGB, and derived depth maps.
Annotation: Precise 3D point-wise labels, projected to 2D masks with invalid-sensor regions labeled.
Defect diversity: 41 types (scratches, geometric deformities, holes, contaminations) across 948 anomalous test scans.

Benchmarked methods include voxel-based and depth-based autoencoders, f-AnoGANs, and pixel/voxel variation models. Metrics follow the original PRO definition, with integration up to FPR=0.3.

Quantitative results:

Best 3D-only (voxel f-AnoGAN): AU-PRO 0.583
Best 3D+RGB: AU-PRO 0.639 (voxel f-AnoGAN), with significant variance by class
Critical weaknesses: Existing autoencoder-based methods blur geometry and miss small or subtle defects; variation models are lightweight but less competitive when color is a strong cue

No method solves the benchmark—significant headroom remains, especially at industry-relevant, low-FPR regimes. Native point-cloud, hybrid geometry+texture embeddings, and noise-robust generative models are suggested as promising directions (Bergmann et al., 2021).

7. Practical Use, Access, and Ongoing Impact

All MVTec datasets are publicly available for research use (see official URLs). Evaluation for the AD 2 dataset is mediated by a server holding withheld ground truth; submissions consist of per-pixel anomaly maps and (optionally) decision thresholds. Use cases focus on driving methodological advances in:

Extremely small or low-contrast defect detection
Bulk and overlapping objects
Robustness to normal appearance variance
Lighting/condition invariant approaches without retraining

These benchmarks define state-of-the-art performance baselines in unsupervised anomaly detection for industrial vision and have revealed the insufficiency of earlier progress as judged by performance saturation on the original MVTec AD. By introducing new modalities, more natural variance, and distribution shifts, they continue to foster innovation in both algorithm design and industrial inspection resilience (Heckler-Kram et al., 27 Mar 2025, Bergmann et al., 2021).