Papers
Topics
Authors
Recent
Search
2000 character limit reached

Outformer: Zero-Shot Tabular Outlier Detection

Updated 10 February 2026
  • Outformer is a zero-shot foundation model-based approach for tabular outlier detection, leveraging synthetic priors (GMMs, SCMs, copulas) to generalize across diverse anomaly scenarios.
  • It utilizes a 10-layer Transformer with in-context learning, enabling plug-and-play inference without labeled outliers, hyperparameter tuning, or task-specific model selection.
  • Its self-evolving curriculum training and context ensembling notably boost performance, achieving state-of-the-art AUROC across more than 1,500 benchmark tasks.

Outformer is a zero-shot, foundation model-based approach for tabular outlier detection (OD) that leverages a Transformer architecture with in-context learning and universal pretraining on a mixture of synthetic distributions. It is designed to enable plug-and-play OD without requiring labeled outliers, hyperparameter tuning, or task-specific model selection. By fitting a prior predictive distribution using a mixture of synthetic generative processes—including Gaussian mixture models, copula-based densities, and structural causal models—Outformer generalizes across diverse anomaly detection scenarios, achieving state-of-the-art performance on over 1,500 benchmark OD tasks (Ding et al., 3 Feb 2026).

1. Model Architecture and Input Encoding

Outformer is a Prior-Data Fitted Network (PFN) tailored to the tabular OD setting, with the following structural features:

  • Transformer Backbone: 10-layer Transformer encoder, each with cross-attention from queries (test points) to context (inlier) points. Context self-attention is optionally present but often zero-initialized or disabled to focus model capacity on query-context interactions.
  • Token Embeddings: Every dd-dimensional row xx is mapped to a learned HH-dimensional embedding (row token), with H=512H=512. When d<100d<100, xx is scaled by $100/d$ and zero-padded up to 100; when d>100d>100, 100 features are randomly subsampled.
  • Feed-forward Layers and Output Head: Each Transformer block contains a feed-forward MLP with GELU activation and residual connections; after the final block, a two-layer MLP (dimensions HH2H \to H \to 2) with ReLU activations outputs logits for inlier vs. outlier, followed by a softmax.
  • Inference Structure: At inference, unlabeled "training" data (inliers) serve as context for scoring queries. Up to 1,000 context rows and 100 features are subsampled, and test predictions are ensembled over 50 bagged contexts.
  • Parameter Count: Approximately 45 million trainable parameters.

This architecture enables Outformer to perform in-context anomaly scoring in a purely zero-shot manner.

2. Synthetic Prior Mixture Pretraining

Outformer’s pretraining corpus is generated using a controlled mixture of five synthetic priors, each yielding labeled tabular data with known inlier/outlier status. The overall joint is

p(x,y)=p=1Pπppp(x,y),πp=1/P, P=5,p(x, y) = \sum_{p=1}^P \pi_p p_p(x, y), \quad \pi_p = 1/P,\ P=5,

where each pp(x,y)p_p(x, y) encapsulates both inlier and outlier mechanisms. The synthetic prior classes are:

  • Gaussian Mixture Models (GMMs): Inliers sampled from a mixture of up to five Gaussian components, with random means and diagonal covariances; correlated coordinates are induced via random linear transforms. Outliers arise from covariance inflation and Mahalanobis thresholding.
  • Structural Causal Models (SCMs): Directed acyclic graphs (DAGs) with nodes as continuous variables, realized via pruned MLPs; outliers arise from noise inflation (measurement outliers) or topological perturbation (structural outliers).
  • Copula-Based Distributions: By Sklar’s theorem, marginals are sampled from various base distributions (Gaussian, Beta, Exponential, Student-t, power law, log-logistic), and dependencies are structured using Gaussian or vine copulas. Outliers are generated by distorting marginal ranks (probabilistic outliers) or shuffling dependency structure (dependence outliers).

Training minimizes the expected cross-entropy between the model qθ(yx,D)q_\theta(y \mid x, D) and the true conditional p(yx,D)p(y \mid x, D):

L=E(D{(xi,yi)})p(D)[logqθ(yixi,D)].L = \mathbb{E}_{(D \cup \{(x_i, y_i)\}) \sim p(D)} [ -\log q_\theta(y_i | x_i, D) ].

This synthetic corpus enables Outformer to form a universal prior over anomaly detection tasks, facilitating robust transfer.

3. Self-Evolving Curriculum Training

The diversity of the synthetic priors necessitates an adaptive training protocol. Outformer uses a self-evolving curriculum (SEC) organized as a non-stationary multi-armed bandit:

  • Arms: Each reflects a unique combination of prior type ϕp\phi_p and feature dimensionality bin BbB_b (P=5P=5 priors and K=5K=5 bins, C=25C=25 arms).
  • Sampling: At each step, minibatches are allocated via Softmax(Qt1(c)/τ)\mathrm{Softmax}(Q_{t-1}(c)/\tau) arm sampling (temperature τ0.5\tau \approx 0.5).
  • Reward Signal: For each arm, the reward is the variance of cross-entropy loss within datasets from that category, promoting focus on neither trivial nor unsalvageably hard tasks:

rt(c)=1DciDc(li,tmeanjDclj,t)2.r_t(c) = \frac{1}{|D_c|} \sum_{i \in D_c} (l_{i,t} - \mathrm{mean}_{j \in D_c} l_{j,t})^2.

  • Exponential Moving Average: Weights updated as Qt(c)=γrt(c)+(1γ)Qt1(c)Q_t(c) = \gamma r_t(c) + (1-\gamma) Q_{t-1}(c) with γ0.1\gamma \approx 0.1.
  • Pacing: Only the easiest ga,b(t)g_{a,b}(t) fraction of points, as sorted by loss, are backpropagated through early in training (progression scheme grows ga,b(t)g_{a,b}(t) from NbN_b to NN).

SEC measurably improves generalization, e.g., boosting AUROC on ADBench from 0.920\approx 0.920 (naïve mixed-prior) to 0.926\approx 0.926, and SynBench GMM-inliner from $0.873$ to $0.930$.

4. Zero-Shot Inference via In-Context Learning

At deployment, Outformer is frozen and no task-specific fine-tuning is required. Unlabeled training data are passed as context CC; test points xx are treated as queries. The input tensor consists of context tokens and query tokens, concatenated along the Transformer sequence dimension. Only the queries attend to the context.

Scoring for a test example is

qθ(y=+1x,C),q_\theta(y = {+}1 \mid x, C),

with the softmaxed output interpreted as the anomaly probability for xx. This method does not require any labeled outliers or model retraining, realizing a true plug-and-play, universal OD system. Ensemble averaging over 50 random contexts further stabilizes prediction.

5. Experimental Protocol and Evaluation

Outformer is validated across three primary real-world benchmarks and one synthetic evaluation:

Benchmark Datasets (Count) Contamination r Source
ADBench 57 [0.01,0.2][0.01, 0.2] Numeric-only OD
OddBench 690 - Semantic/tabular, mined via keywords ("fraud," etc.)
OvRBench 756 [0.05,0.2][0.05,0.2] "One-vs-rest" anomalies from classification corpora
SynBench 800 Synthetic (various) In-distribution (ID) tasks

Metrics include AUROC, AUPRC per dataset, with aggregate reporting by average rank, Elo rating, winrate, rescaled AUC (rAUC), and champion delta (CΔC_\Delta). Notably:

  • Outformer achieves AUROC 0.926\approx 0.926 on ADBench, outperforming DTE-NP, TabPFN-OD, kNN, IForest, and all deep baselines.
  • Aggregate results across all 1,500+ datasets demonstrate competitive or superior performance: average rank 3.55\approx 3.55 (best), Elo 1122\approx 1122, winrate 0.59\approx 0.59, rAUC 0.935\approx 0.935, CΔ0.23C_\Delta \approx 0.23.
  • On SynBench (ID), AUROC is 0.994\approx 0.994 when SEC is active.

6. Ablation Studies and Analysis

A series of ablations clarify Outformer's critical components:

  • Mixture versus GMM-only: GMM-only training supports Gaussian-like OD transfer but fails outside its domain; mixed-prior training without SEC degrades GMM and ADBench performance, recoverable via SEC.
  • SEC Removal: Removing SEC reduces winrate by \sim5% and rAUC by \sim1.3 points.
  • Ensembling: Disabling context bagging decreases rAUC from 0.935\approx 0.935 to 0.917\approx 0.917.
  • Hyperparameters: Curriculum temperatures τ\tau in [0.3,0.5][0.3,0.5] perform optimally; binary rewards or temperature missetting degrade outcomes.
  • Synthetic Prior Removal: Removing GMMs yields the largest ADBench rAUC drop (0.9860.9290.986 \to 0.929); copula prior deletions also harm transfer. SCMs have less impact on SynBench but are still relevant for real data.

7. Limitations and Directions for Further Research

Outformer's current instantiation is limited to continuous-features tabular data; categorical or mixed data requires extension of priors. Potential improvements include the exploration of 2D attention (feature-to-feature) and mixture-of-experts backbones, as well as refining context optimization strategies beyond simple subsampling. These directions would support broader applicability and further reinforce plug-and-play deployment (Ding et al., 3 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OUTFORMER.