Patch-SVDD for Anomaly Detection

Updated 7 February 2026

The paper introduces Patch SVDD, an unsupervised framework that achieves 0.921 detection and 0.957 segmentation AUROC on the MVTec AD benchmark.
It employs dual-scale hierarchical encoding with 32×32 and 64×64 patches, using a pull-together loss and context-prediction to generate robust, fine-grained representations.
The approach fuses patch-level anomaly scores for both localization and segmentation, balancing local clustering against feature informativeness for superior performance.

Patch SVDD (Patch-level Support Vector Data Description) is an unsupervised anomaly detection and segmentation framework for images, designed to enable fine-grained localization of anomalies by extending the deep learning variant of SVDD to operate at the patch level and incorporating a self-supervised learning component. The method achieves state-of-the-art results on the MVTec AD benchmark, substantially improving both anomaly detection and segmentation performance relative to prior approaches (Yi et al., 2020).

1. Mathematical Foundation

Patch SVDD evolves from classical SVDD, which seeks the minimal-volume hypersphere that contains the embeddings of normal samples in feature space. In the deep SVDD model, standard kernel mappings are replaced with neural encoders, typically yielding the objective:

$L_{\text{SVDD}} = \sum_{i=1}^N \| f_\theta(x_i) - c \|^2$

where $f_\theta(x)$ is the learned embedding, and $c$ is the centroid of normal embeddings.

A naïve patch-based extension directly applies this principle to overlapping image patches; however, heterogeneity among patches (e.g., object, background, texture) leads to high intra-class variation, making a single-centroid approach unsuitable for dense anomaly localization.

Patch SVDD circumvents this by introducing a “pull-together” loss between spatially adjacent patches:

$L_{\text{SVDD}}' = \| f_\theta(p) - f_\theta(p') \|_2$

where $p'$ is a neighbor of $p$ in the 3×3 grid, thereby encouraging locally similar patch embeddings to cluster without enforcing a global unimodal structure.

To ensure meaningful, non-collapsed representations, Patch SVDD integrates a self-supervised context-prediction task. Given two patches $p, p_2$ from the 3×3 neighborhood with ground-truth relative position $y \in \{1, ..., 8\}$ , a classifier $C_\phi$ predicts $y$ from their embedding difference:

$L_{\text{SSL}} = \mathrm{CrossEntropy}(y, C_\phi(f_\theta(p), f_\theta(p_2)))$

The total loss for training is:

$L_{\text{PatchSVDD}}(\theta, \phi) = \lambda L_{\text{SVDD}}'(\theta) + L_{\text{SSL}}(\theta, \phi)$

2. Architecture and Self-Supervision

The encoder $f_\theta$ consists entirely of convolutional layers (no biases), followed by LeakyReLU activations with slope $\alpha = 0.1$ . Patch SVDD employs a two-level hierarchical encoder, summarized as follows:

"Small" encoder ( $f_{\text{small}}$ ): receptive field 32, operates on $32 \times 32$ patches.
"Big" encoder ( $f_{\text{big}}$ ): processes $64 \times 64$ patches by subdividing into four $32 \times 32$ sub-patches, embedding each with $f_{\text{small}}$ , and aggregating via concatenation and $1 \times 1$ convolutions.
Embedding dimensionality $D=64$ .

The self-supervised classifier $C_\phi$ is a two-layer MLP with 128 hidden units, operating on embedding differences. To suppress spurious cues, color channels are randomly perturbed during patch sampling.

3. Training Methodology

Training is conducted exclusively on normal images resized to $256 \times 256$ :

For $f_{\text{big}}$ , extract overlapping $64 \times 64$ patches (stride 16).
For $f_{\text{small}}$ , extract $32 \times 32$ patches (stride 4).
Each iteration samples a patch $p$ , a neighbor $p'$ (for $L_{\text{SVDD}}'$ ), and another local patch $p_2$ (for $L_{\text{SSL}}$ ).
Optimization uses Adam ( $\beta_1=0.9, \beta_2=0.999$ ), learning rate $1e-4$, batch size 256, and 50 epochs.
No geometric augmentations (e.g., flips, rotations) are used.
The hyperparameter $\lambda$ is selected per class; smaller values ($0.1-0.5$) for object classes, larger values ( $\geq 1$ ) for textures.

4. Anomaly Detection and Segmentation Pipeline

After training, all normal patches from the training set are encoded and stored for nearest-neighbor searches, forming databases $S_{\text{big}}$ and $S_{\text{small}}$ .

For a test image:

Overlapping patches are extracted with the same strides and embedded via both encoders.
Patch anomaly score is computed as $A^{\text{patch}}(p) = \min_{h \in S} \| f(p) - h \|_2$ , independently at small and big scales.
Pixel-level anomaly maps $M_{\text{big}}$ and $M_{\text{small}}$ are derived by distributing patch scores to pixels and averaging over coverage. They are fused via element-wise multiplication: $M_{\text{multi}} = M_{\text{big}} \odot M_{\text{small}}$ .
The global image anomaly score is $A^{\text{image}}(x) = \max_{i, j} M_{\text{multi}}(x)_{i, j}$ .

This design supports both image-level and fine-grained segmentation ROC analyses.

5. Experimental Results

Patch SVDD was evaluated on the MVTec AD dataset (15 industrial classes, comprising both objects and textures). Metrics are per-class AUROC for both detection and segmentation.

Method	Detection AUROC	Segmentation AUROC
Deep SVDD (ICML ’18)	0.592	–
GEOM (NeurIPS ’18)	0.672	–
GANomaly (ACCV ’18)	0.762	–
ITAE (arXiv ’19)	0.839	–
Patch SVDD (ours)	0.921	0.957
L₂-AE	–	0.804
SSIM-AE	–	0.818
VE-VAE (CVPR ’20)	–	0.861
VAE Proj (ICLR ’20)	–	0.893

Patch SVDD achieves +9.8% detection and +7.0% segmentation AUROC improvement over the prior best entries.

6. Functional Analysis and Ablation Insights

Replacing the single-center loss with "pull-together" ( $L_{\text{SVDD}}'$ ) improves AUROC, and further addition of the context-prediction ( $L_{\text{SSL}}$ ) yields the highest performance.
Ablation studies indicate that object classes, characterized by high intra-class patch variation, benefit more from self-supervised context prediction, while texture classes are less sensitive to this term.
Feature visualizations (t-SNE) show single-modality clusters without $L_{\text{SSL}}$ , and semantic, multi-modal clusters when both losses are used. The lowest intrinsic feature dimension occurs when using both components.
Hierarchical multi-scale encoding (combining $64 \times 64$ and $32 \times 32$ scales) outperforms either scale alone and single-level multi-branch architectures, suggesting that shared sub-encoders induce regularization and an inductive bias.
The selection of $\lambda$ balances local clustering against feature informativeness; optimal values differ per class.
Embedding dimensionality exhibits diminishing returns beyond $D=64$ .
Surprisingly, nearest-neighbor anomaly detection on random CNN features or even raw pixel patches can suffice for certain classes, as for one-layer convolutions, Euclidean distance in feature and pixel space are closely related.

7. Limitations and Potential Extensions

Patch SVDD requires maintaining large databases of normal patch embeddings and performing approximate nearest-neighbor search at inference time, which can be both memory- and compute-intensive. Hyperparameter tuning of $\lambda$ is manual and dataset-dependent. Extending the approach to additional scales, or incorporating further self-supervised objectives (e.g., rotation or jigsaw prediction), are plausible routes for increased robustness and generalization (Yi et al., 2020).

Code and pretrained models are publicly available, facilitating adaptation and reproduction for diverse anomaly detection pipelines.

Markdown Report Issue Upgrade to Chat

References (1)

Patch SVDD: Patch-level SVDD for Anomaly Detection and Segmentation (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Patch-SVDD for Anomaly Detection.