Intrinsic Normal Prototypes Extractor

Updated 1 January 2026

Intrinsic Normal Prototypes (INPs) Extractor is a technique that learns a compact set of template vectors capturing the statistical modes of normal data for anomaly detection.
It employs an encoder to extract features, a parameterized INP extraction module, and an INP-guided decoder to reconstruct only normal patterns.
The method supports unsupervised and weakly supervised learning across domains like medical imaging and industrial inspection, ensuring robust anomaly scoring.

Intrinsic Normal Prototypes (INPs) Extractor

Intrinsic Normal Prototypes (INPs) Extractors constitute a class of architectures and algorithms that learn a compact set of template vectors representing the internal statistics or semantic modes of "normal" (non-anomalous) data in a learned latent space. These prototypes underpin prototype-based anomaly detection methods by functioning as an intrinsic “dictionary” of normality, enabling localized or global anomaly scoring without reliance on reference sets of normal data. INP Extractors are widely adopted in medical imaging, industrial inspection, and other domains requiring unsupervised or weakly supervised detection of rare deviations.

1. Definition and Theoretical Motivation

INPs are learnable template vectors—sometimes referred to as tokens, prototypes, or centroids—trained to capture the modes of normal feature patterns captured by a deep encoder. The primary motivation is the observation that even anomalous samples contain abundant normal regions or structures, and that the ability to synthesize prototypes from the same image (or batch) provides a robust, distribution-aligned internal normal reference. Unlike reference-based methods requiring an external support set, INP-based approaches are intrinsically support-free and adapt to statistical variation without explicit realignment, facilitating transferability, fine-grained localization, and handling of incomplete data scenarios (Luo et al., 4 Mar 2025, Luo et al., 4 Jun 2025, Wu et al., 24 Dec 2025, Zhao et al., 11 Sep 2025, Zhao et al., 5 Nov 2025).

2. Architectures and Algorithmic Variants

The INPs Extractor paradigm manifests in several architectures, but most utilize three key modules:

Feature Encoder: A pre-trained CNN or Vision Transformer (ViT, often DINOv2-ViT-Base/14) that produces multi-scale feature maps or patch tokens from input data.
INPs Extraction Module: A parameterized mechanism—usually a set of learnable query tokens followed by a cross-attention and feed-forward layer—that produces a small set $\{p_m\}$ of prototype vectors from the aggregated encoder output. In some formulations, prototypes are updated via optimal transport (OT) assignment (Trombetta et al., 18 Aug 2025), or learned by deep clustering and contrastive loss (Dong et al., 2024).
INP-Guided Decoder: A transformer or U-Net decoder that reconstructs encoder features, using only the INPs as keys and values in each block's cross-attention. This constraint ensures that only normal patterns can be synthesized, making anomalous regions unreproducible and thus detectable via reconstruction error.

In table form:

Component	Purpose	Key Algorithms
Encoder	Extracts latent features/tokens	DINOv2-ViT, ResNet50, parametric encoder
INPs Extractor	Synthesizes normality prototypes	Transformer cross-attention, OT, clustering
INP-Guided Decoder	Reconstructs normal content only	Attention over prototypes, U-Net, ViT-block

Most recent methods adopt multi-layer architectures where INPs are injected at each decoding stage, and prototype extraction may be image-specific (extracted from test image) or dataset-global (shared across the dataset) depending on the task context (Luo et al., 4 Mar 2025, Wu et al., 24 Dec 2025, Zhao et al., 11 Sep 2025, Luo et al., 4 Jun 2025).

3. Mathematical Formulation

A canonical INPs Extractor comprises the following computational steps:

Feature Extraction: Given an input $x$ (e.g., an image or MRI slice), the encoder produces feature maps $\{f^\ell\}_{\ell=1}^L$ ; these are aggregated (sum or concat) into $F \in \mathbb{R}^{N \times C}$ , where $N$ is the number of spatial patches, $C$ is the feature dimension.
Prototype Extraction:
- Initialize $M$ prototype tokens $T = \{t_m \in \mathbb{R}^C\}_{m=1}^M$ .
- Compute
$Q = T W^Q, \quad K = F W^K, \quad V = F W^V$

Cross-attention updates:

$A = \text{softmax}(QK^\top / \sqrt{C}) \in \mathbb{R}^{M \times N}$

$P = \mathrm{FFN}(A V) + A V$

yielding $\{p_m\}_{m=1}^M$ as the set of INPs (Luo et al., 4 Mar 2025, Wu et al., 24 Dec 2025, Zhao et al., 11 Sep 2025).
Loss Functions:
- INP Coherence / Consistency Loss: Pulls encoder tokens toward the closest prototype,
$L_{\text{con}} = \frac{1}{N} \sum_{i=1}^N \min_{m} [1 - \text{CosSim}(f_i, p_m)]$

Reconstruction Loss: Compares original features to decoder output using cosine distance, adaptively weighted by patch difficulty,

$L_{\text{rec}} = \frac{1}{2} \sum_{i \in \{0,1\}} \big[ \text{CosDist}(En_i, De_i) \cdot \omega_i \big]$

weighting $\omega_i = (\bar{d}/d_i)^\gamma$ (Wu et al., 24 Dec 2025).
Distribution Alignment Loss: Keeps features distribution-invariant under missing modalities (for multi-modal data),

$L_{\text{dist}} = \text{MSE}(\mu_{\text{current}}, \mu_{\text{full}}) + \text{MSE}(\sigma^2_{\text{current}}, \sigma^2_{\text{full}})$

Total loss is typically

$L_{\text{total}} = L_{\text{rec}} + \lambda_1 L_{\text{con}} + \lambda_2 L_{\text{dist}}$

with $\lambda_i$ obtained by ablation (Wu et al., 24 Dec 2025).

Alternative Prototyping: Some frameworks employ optimal-transport assignments of embeddings to grid-anchored prototypes, yielding local and global INPs. The cost function combines cosine (feature) and Euclidean (spatial) distances:

$C(f_i, P_j) = (1-\alpha)[1-\langle z_i, p_j \rangle / (\|z_i\| \|p_j\|)] + \alpha \|c_i - \rho_j\|^2$

(Trombetta et al., 18 Aug 2025).

4. Training Strategies and Regularization

Robust INP learning depends on the following training regimes:

Support-Free/Intrinsic Training: INPs are synthesized within each image or batch, using only normal data. No external support set is required (Luo et al., 4 Mar 2025, Zhao et al., 11 Sep 2025, Zhao et al., 5 Nov 2025).
Augmenting with Pseudo-Anomalies: Synthetic defects are introduced into normal images, generating pseudo-anomaly masks. Coherence or purity losses use these masks to ensure INPs are only responsive to normal regions (Zhao et al., 11 Sep 2025, Zhao et al., 5 Nov 2025).
Randomized Modality Masking: In incomplete multi-modal setups, modality inputs are randomly masked per iteration to encourage generalization and modality-invariance (Wu et al., 24 Dec 2025).
Deep Clustering and Contrastive Regularization: In weakly supervised settings, prototypes are refined using both embedding clustering and contrastive loss to sharpen normal mode differentiation and filter possible contamination in unlabeled data (Dong et al., 2024).
Mining-Based Weighting: Soft mining losses scale gradient flow to patches that exhibit higher reconstruction discrepancy, focusing the model's capacity on harder-to-reconstruct or less-represented normal patterns (Luo et al., 4 Mar 2025, Luo et al., 4 Jun 2025).

A unified stochastic optimization (AdamW, StableAdamW) is usually applied to all components with moderate prototype counts ( $M=4$ –$8$), and performance saturates for moderate prototype numbers (Luo et al., 4 Mar 2025, Luo et al., 4 Jun 2025).

5. Inference and Anomaly Scoring

At test time, INPs are extracted from the test image, or the learned “global” prototypes are used depending on whether the method is image-intrinsic or dataset-level. The reconstruction process (through the INP-guided decoder) produces feature reconstructions restricted to the space of normal patterns. Anomaly scoring employs the following principles:

Pixel-wise/Token-wise Residuals: The anomaly score at each location is measured as the residual (cosine or $L^2$ distance) between encoder features and decoder reconstructions,

$s(h,w) = \text{CosDist}(f_{\text{enc}}(h,w), \hat{f}_{\text{dec}}(h,w))$

Postprocessing: Residuals are normalized or thresholded, sometimes using adaptive binarization (e.g., plateau-based threshold selection) to select salient connected components for segmentation (Zhao et al., 5 Nov 2025).
Image-Level Aggregation: An image-level anomaly score may be computed as the average of the top $1\%$ patch-wise anomaly scores (Luo et al., 4 Mar 2025).
Few-Shot/Zero-Shot Generalization: In semi-supervised or transfer settings, the INP representation supports novel-class discovery by clustering residuals or applying teacher–student self-supervision with mask-guided attention (Zhao et al., 5 Nov 2025).

6. Applications and Empirical Performance

INP Extractor frameworks have been validated in medical imaging (unified MRI anomaly detection across incomplete modalities) (Wu et al., 24 Dec 2025), industrial/IC defect segmentation (Zhao et al., 11 Sep 2025, Zhao et al., 5 Nov 2025), and universal anomaly detection benchmarks spanning single-class, multi-class, few-shot, and zero-shot regimes (Luo et al., 4 Mar 2025, Luo et al., 4 Jun 2025).

Selected empirical results:

Method	Domain	Image AUROC	Pixel AUROC	SOTA claim	Reference
INP-Former	MVTec-AD (multi)	99.7	98.5	Yes	(Luo et al., 4 Mar 2025)
INP-Former++	MVTec-AD (multi)	99.8	98.7	Yes	(Luo et al., 4 Jun 2025)
AnyAD (INPs)	BraTS2018/clinical	Maintained	Maintained	Yes (7 configs)	(Wu et al., 24 Dec 2025)
IC DefectNCD	IC manufacturing	Robust	Robust	Yes (multi-stage)	(Zhao et al., 5 Nov 2025)

INP-based systems maintain generalization across varying input modalities, demonstrate robustness to missing data, and consistently outperform or match state-of-the-art baselines. Ablation studies confirm critical roles for INP extraction, coherence loss, and mining-based weighting in achieving top performance (Luo et al., 4 Mar 2025, Luo et al., 4 Jun 2025, Wu et al., 24 Dec 2025).

7. Extensions and Generalization

The INPs Extractor framework is adaptable beyond current image-based anomaly detection:

Arbitrary Modality Handling: Distribution alignment enables INPs to generalize in multimodal or incomplete data contexts (Wu et al., 24 Dec 2025).
Hierarchical and Spatially-Constrained Prototypes: Local/global structural priors (e.g., OT-based spatial anchoring) allow for fine-grained pattern modeling and context-sensitive anomaly localization (Trombetta et al., 18 Aug 2025).
Classifier/Segmentor Variants: By substituting anomaly scoring with nearest-prototype classification, INPs Extractors naturally extend to prototype-based classification or segmentation in both supervised and unsupervised settings (Trombetta et al., 18 Aug 2025, Dong et al., 2024).
Semi-Supervised Discovery: Combining residual localization from INPs with teacher–student frameworks and adaptive clustering enables unsupervised or semi-supervised discovery of novel categories (Zhao et al., 5 Nov 2025).
General Prototype Learning: The underlying prototype learning infrastructure is applicable to domains where structure-anchored or mode-clustered representations are beneficial, including domain adaptation, novelty detection, and self-supervised representation learning.

A plausible implication is that as encoder architectures advance and embedding spaces become more semantically structured, the role of INPs and their extractors will continue to expand, both as anomaly detectors and as universal, support-set-free priors in broader pattern analysis frameworks.