PaAno: Patch-based Learning for Anomaly Detection

Updated 8 February 2026

The paper introduces a patch-based approach where time series are segmented into localized patches, enabling efficient detection of subtle, local anomalies.
It employs diverse encoder architectures such as CNNs, transformers, and MLP-Mixers along with multi-scale patching to capture both fine-grained and long-range dependencies.
Empirical results demonstrate that patch-based methods achieve state-of-the-art metrics on benchmarks like SWaT and SMAP while significantly reducing computational overhead.

Patch-based representation learning for time-series anomaly detection (PaAno) refers to model architectures and training paradigms that segment a time series into short, overlapping or non-overlapping patches, encode these temporally localized patterns via learnable representations, and leverage these patch embeddings for the identification of anomalous events. This approach harnesses the inductive bias that many time-series anomalies manifest as local, subtle deviations in short contiguous intervals. Recent advances have shown patch-based methods can rival or outperform heavyweight transformer and foundation model approaches in both efficiency and detection quality, owing to their capacity for disentangling multi-scale, local, and contextual dependencies while remaining computationally tractable (Park et al., 1 Feb 2026).

1. Patch Extraction and Patch-based Input Construction

The core of patch-based methods is the segmentation of the input time series into localized temporal segments or "patches." Given a multivariate or univariate sequence $\mathbf{X} = [\mathbf{x}_1, \dots, \mathbf{x}_T]$ , with $\mathbf{x}_t \in \mathbb{R}^d$ , a sliding or non-overlapping window of length $w$ is used to extract patches: $\mathcal{P} = \{ \mathbf{p}_t \mid \mathbf{p}_t = [\mathbf{x}_t, \mathbf{x}_{t+1}, \ldots, \mathbf{x}_{t+w-1}]\}, \quad t = 1, \ldots, T - w + 1.$ Patch size $w$ and stride are critical hyperparameters (e.g., $w=96$ for multivariate in PaAno (Park et al., 1 Feb 2026), variable patch sizes for multi-scale modeling in (Zhang et al., 19 Apr 2025), and $P_{\text{len}}$ with stride $S$ in PatchTrAD (Vilhes et al., 10 Apr 2025)).

Multi-scale patching is often used to capture both fine-grained and long-range dependencies. For instance, CPatchBLS ensembles multiple Dual-PatchBLS models at a range of $S_{\text{patch}}^i$ (Li et al., 2024), and TransDe concatenates patch representations from several scale channels after decomposition (Zhang et al., 19 Apr 2025). Certain models (e.g., PatchAD (Zhong et al., 2024)) use parallel patch sizes simultaneously to promote robustness to anomaly scale.

2. Patch Encoder Architectures

Patch-based anomaly detectors employ a variety of encoder structures to map each patch into a latent vector. Representative architectural choices include:

CNN encoders: PaAno uses a 4-layer 1D CNN ( $[7,5,3,3]$ kernels) with instance normalization and global average pooling, projecting to a compact 64-dim embedding (Park et al., 1 Feb 2026). Similarly, TriP-LLM leverages a two-layer causal 1D conv followed by depthwise conv, mean-based residual, and linear projection (Yu et al., 31 Jul 2025).
Patchwise transformers: PatchTrAD flattens all patches (across channels) into tokens, applies a vanilla Transformer encoder (multi-head self-attention, $D_\text{model}=128$ ), and processes all $M \times P_\text{num}$ patches in parallel (Vilhes et al., 10 Apr 2025). TransDe/PaAno decomposes input into trend/cyclical, patches each, and processes with lightweight multi-scale transformers (Zhang et al., 19 Apr 2025).
MLP-Mixer/Feedforward encoders: PatchAD uses MLP-mixer networks with Channel, Inter/Intra, and MixRep Mixers to promote cross-channel/global and intra-patch mixing (Zhong et al., 2024). CPatchBLS employs a shallow Broad Learning System on patch matrices (Li et al., 2024).
LLM/Backbone patch-embedding: TriP-LLM fuses three patch embedding streams (local, selective, global) and projects their gate fusion to produce LLM-ready tokens; the LLM remains frozen during finetuning (Yu et al., 31 Jul 2025). MOMEMTO passes pre-trained patch embeddings through memory gating before integrating into a foundation backbone (Yoon et al., 23 Sep 2025).

Auxiliary projection heads, gating, or positional encodings may be added on top of patch encodings for additional regularization or for adaptation to downstream modules.

3. Representation Learning Objectives and Regularization

Patch-based TSAD models leverage loss functions that yield embeddings particularly sensitive to deviations from normality while remaining robust to intra-class variation:

Contrastive and Metric Learning: PaAno enforces that temporally adjacent patches (positives) are closer in projected space than distant negatives (max-hard triplet loss with margin $\delta=0.5$ ) (Park et al., 1 Feb 2026). PatchAD uses a symmetric KL-divergence and InfoNCE-inspired contrastive losses across inter/intra patch views, with further regularization by a dual-projection constraint (Zhong et al., 2024). TransDe deploys a pure KL-contrastive learning paradigm (between intra/inter-patch dependency matrices, using a stop-gradient strategy), with no explicit reconstruction (Zhang et al., 19 Apr 2025).
Pretext and Auxiliary Tasks: PaAno augments triplet learning with a pretext binary classification loss via a 2-head network, distinguishing temporal order among neighbor and randomly sampled patches (active in early training phase) (Park et al., 1 Feb 2026). NCAD applies a contextual-hypersphere classification loss between contextual and suspect patch embeddings (Carmona et al., 2021).
Memory-Augmented Representation: MOMEMTO's loss comprises both reconstruction (for patch-wise output) and memory regularization, penalizing excessive drift of memory items from their initial domain-averaged prototypes (Yoon et al., 23 Sep 2025).
Reconstruction-based Losses: PatchTrAD, CPatchBLS, TriP-LLM, and PatchAD also deploy patch-wise mean squared error losses that incentivize precise local reconstruction from learned representations (Vilhes et al., 10 Apr 2025, Li et al., 2024, Yu et al., 31 Jul 2025, Zhong et al., 2024).

Contrastive or discrepancy-based signals (e.g., dual-view KL in CPatchBLS, inter-intra discrepancy in PatchAD) are used to align or distinguish representations under different augmentation or architectural choices, increasing sensitivity to subtle anomalies.

4. Patch-wise Anomaly Scoring and Inference

Anomaly detection hinges on aggregating evidence from patch representations around or covering each time point. Common procedures include:

Patch-matching against normal memory: At test time, a "memory bank" of patch embeddings from normal training data (optionally reduced by K-means) is queried using $k$ -nearest neighbor cosine distances; the anomaly score for time $t$ averages over all patches covering $t$ and their neighbor distances (Park et al., 1 Feb 2026).
Patch-wise reconstruction error: Many methods (PatchTrAD, TriP-LLM, CPatchBLS) reconstruct the patch and use the squared error per patch as $e^{(m)} = \|x_{p,i}^{(m)} - \tilde{x}_{p,i}^{(m)}\|_2$ ; per-timepoint or per-segment anomaly scores are then summed across modalities/channels or aggregated over overlapping patches (Vilhes et al., 10 Apr 2025, Li et al., 2024, Yu et al., 31 Jul 2025).
Inter-patch discrepancy: PatchAD derives its score from the KL-divergence between inter- and intra-view patch embeddings (Zhong et al., 2024); CPatchBLS uses the symmetric KL between basic and SKP-perturbed BLS outputs (Li et al., 2024); TransDe fuses intra/inter dependencies to provide final scores (Zhang et al., 19 Apr 2025).
Memory-augmented refinement: MOMEMTO enhances a foundation model's encoder output using memory-read attention; anomaly score derives from reconstruction and distance to memory items (Yoon et al., 23 Sep 2025).

The table below compares key patch-based anomaly scoring mechanisms:

Model	Scoring Basis	Aggregation/Augmentation
PaAno	Patch→NN distance (cosine)	Avg. over covering patches at $t$
PatchTrAD	Patch recon. error	Sum/Max across modalities
TriP-LLM	Patch recon. error	Merge overlap; downstream MLP decoding
PatchAD	Inter-intra KL discrepancy	KL-sum/symmetric divergence
CPatchBLS	Dual-branch KL discrepancy	Scale-averaged KL divergence
MOMEMTO	Recon. + memory-alignment	Patch-level, with memory-guided refinement

5. Empirical Performance and Computational Efficiency

Patch-based approaches have achieved state-of-the-art performance on diverse benchmarks including TSB-AD, SMD, SMAP, SWaT, PSM, and WADI. For example:

PaAno achieves first-place on all six main metrics in TSB-AD-U and TSB-AD-M, outperforming foundation models and Transformers with one-tenth parameters and order-of-magnitude faster runtime (Park et al., 1 Feb 2026).
PatchTrAD's mean ROC-AUC (across six multivariate and univariate datasets) reaches 0.814, outperforming PatchAD (0.811), TranAD (0.751), and classical LSTM-AE/USAD (Vilhes et al., 10 Apr 2025).
TransDe's F1 scores reach 98.04% on SWaT and 96.67% on SMAP, exceeding the best Transformer-based detectors (Zhang et al., 19 Apr 2025).
TriP-LLM achieves higher PATE (threshold-free) scores than all baselines across five benchmarks, with robust memory efficiency (e.g., 3.3× GPU memory reduction over channel-independence variants) (Yu et al., 31 Jul 2025).
CPatchBLS achieves ROC-AUC ≈ 99.81%, PR-AUC ≈ 98.25%, PA-F1 ≈ 96.87%, with training/inference times $\ll$ deep models (Li et al., 2024).
MOMEMTO as a single cross-domain model achieves higher AUC-PR and VUS-PR than MOMENT and classic tree-based models, especially in few-shot settings (Yoon et al., 23 Sep 2025).

Efficiency gains stem from compact encoder design (CNN, BLS, MLP-Mixer), parallelizable patchwise inference, memory reduction techniques, and avoidance of heavy autoregressive or full-sequence modules.

6. Model Variants and Extensions

Patch-based representation learning for time-series anomaly detection encompasses a family of adaptable frameworks:

Memory-Augmented Patch Encoders: Incorporation of explicit memory banks (MOMEMTO, PaAno, TriP-LLM) enhances robustness to over-generalization and allows domain-agnostic representation sharing (Yoon et al., 23 Sep 2025, Park et al., 1 Feb 2026, Yu et al., 31 Jul 2025).
Multi-scale and Multi-view Integration: Simultaneous encoding at multiple patch lengths (PatchAD, CPatchBLS, TransDe) or with multiple architectural paths (TriP-LLM: patch, selection, global branches) increases sensitivity to anomalies at different scales and types (Zhong et al., 2024, Li et al., 2024, Zhang et al., 19 Apr 2025, Yu et al., 31 Jul 2025).
Contrastive Objectives beyond Reconstruction: Asymmetric and symmetric KL, InfoNCE, dynamic hypersphere contrast, or dual-branch discrepancy serve as core principles for calibrating the latent space, often yielding higher anomaly discrimination than pure reconstruction (Carmona et al., 2021, Zhang et al., 19 Apr 2025, Zhong et al., 2024).
Integration with Foundation/LLM Models: Feeding patch tokens into frozen LLM or TFM backbones (TriP-LLM, MOMEMTO) preserves pretraining capacity and reduces downstream finetuning cost, and architectural innovations (e.g., gate fusion, decoder heads) are optimized for resource and performance constraints (Yu et al., 31 Jul 2025, Yoon et al., 23 Sep 2025).

7. Current Trends, Limitations, and Research Directions

Patch-based TSAD has matured into a paradigm balancing local context modeling with high-throughput inference. Ongoing challenges and themes include:

Scaling to ultra-long and high-dimensional series, which can cause combinatorial growth in patch count and memory usage; remedies include smarter patch selection, coreset memory banks, and fused branch aggregation (Park et al., 1 Feb 2026, Yu et al., 31 Jul 2025).
Avoiding over-generalization, especially in large foundation models; memory regularization, prototype gating, and contrastive calibration are prominent countermeasures (Yoon et al., 23 Sep 2025).
Unifying semi-supervised and unsupervised anomaly detection within patch-based frameworks, with variants such as contextual anomaly injection and hypersphere classifiers bridging the gap (Carmona et al., 2021).
Exploiting multi-domain and transfer learning capacity by joint patch-memory modules across related time series domains (Yoon et al., 23 Sep 2025).
Moving beyond MSE-based losses toward adaptive contrastive and discrepancy-based objectives to prioritize representational discriminableness over pure fidelity (Zhang et al., 19 Apr 2025, Zhong et al., 2024).

Theoretical analysis of patch receptive field, inductive bias, and statistical power remains open for further research. Practical deployment will benefit from continued integration with efficient foundation model backbones and from architectures explicitly optimized for resource constraints.