Papers
Topics
Authors
Recent
Search
2000 character limit reached

Percentile-Based Anomaly Scoring

Updated 4 February 2026
  • Percentile-based anomaly scoring is a nonparametric method that computes anomaly scores by quantifying data ranks and p-values from a nominal distribution.
  • It employs techniques like K-NN distances, graph degree statistics, and learning-to-rank surrogates to achieve efficient computation and precise false-alarm control.
  • The approach offers strong theoretical guarantees such as asymptotic consistency and UMP properties, making it effective for applications like fraud detection and crowd analysis.

Percentile-based anomaly scoring is a nonparametric framework for anomaly detection in high-dimensional data, centering on the use of empirical ranks or percentile statistics derived from nominal data distributions. It offers precise control over the false-alarm (type I error) rate by thresholding anomaly scores at specified percentiles, and underlies several scalable, theoretically-optimal algorithms in modern anomaly detection literature. Its principal instantiations—most notably nearest-neighbor-based empirical p-values and their learned ranking function surrogates—combine nonparametric density-adaptivity with computational efficiency and rigorous statistical guarantees (Qian et al., 2014, &&&1&&&, Root et al., 2016, Qian et al., 2015).

1. Formal Problem Definition and Core Statistical Principle

Percentile-based anomaly detection methods consider a nominal training set X={xi}i=1nRdX = \{x_i\}_{i=1}^n \subset \mathbb{R}^d sampled i.i.d. from an unknown density ff. For a desired false-alarm level α(0,1)\alpha \in (0,1), the canonical objective is to construct a test that, for any query ηRd\eta \in \mathbb{R}^d, declares η\eta anomalous if its estimated pp-value p(η)αp(\eta) \leq \alpha, where

p(η)=P{x:f(x)f(η)}={x:f(x)f(η)}f(x)dxp(\eta) = P\{x : f(x) \leq f(\eta)\} = \int_{\{x : f(x) \leq f(\eta)\}} f(x) dx

This formulation implies that the acceptance region—the minimum-volume set of density measure 1α1-\alpha—is

U1α={x:p(x)1α}U_{1-\alpha} = \{x : p(x) \geq 1 - \alpha\}

Anomalies are thus samples whose empirical pp-values fall within the α\alpha-percentile (lowest density) region (Qian et al., 2014).

2. Methodologies: Empirical Percentile Estimation from Nearest-Neighbor Graphs

Operationally, percentile-based anomaly scoring techniques estimate p(η)p(\eta) or its surrogates via nonparametric statistics derived from the nominal set. The most prominent constructions utilize:

  • KK-Nearest-Neighbor (K-NN) Distance Statistics: For each xXx \in X, compute D(k)(x)D_{(k)}(x) (distance to kkth nearest neighbor). Define a local score G(x)=1Kk=1KD(k)(x)G(x) = -\frac{1}{K} \sum_{k=1}^K D_{(k)}(x). The empirical percentile rank is

r(x)=1nj=1n1{G(xj)G(x)}r(x) = \frac{1}{n} \sum_{j=1}^n \mathbf{1}\{G(x_j) \leq G(x)\}

For a test point η\eta, r(η)r(\eta) serves as its percentile-based anomaly score. Under mild regularity and nn \rightarrow \infty, r(x)p(x)r(x) \rightarrow p(x) (Qian et al., 2014, 0910.5461).

  • KK-NN Graphs and Degree Statistics (ε-neighborhoods): Alternatively, use the degree NS(x)N_S(x) of node xx in an ϵ\epsilon-neighborhood graph. For query η\eta, define

p^ϵ(η)=1ni=1n1{NS(η)NS(xi)}\hat{p}_\epsilon(\eta) = \frac{1}{n} \sum_{i=1}^n \mathbf{1}\{N_S(\eta) \geq N_S(x_i)\}

These statistics empirically estimate the p-values, yielding a uniform distribution under the null (0910.5461, Root et al., 2016).

  • Velocity-based Empirical Percentiles: In structured domains (e.g., crowd motion analysis), per-sample features such as normalized velocities are clustered, and empirical percentiles within semantic classes are used as thresholds for anomaly scores (AlGhamdi et al., 21 Oct 2025).

3. Learning-to-Rank Approximations and Rank-SVM

Given the computational inefficiency of direct KK-NN-based scoring at test time, percentile-based anomaly detectors frequently employ max-margin learning-to-rank surrogates in a reproducing kernel Hilbert space (RKHS). The process involves:

  • Quantizing empirical percentiles (e.g., r(xi)r(x_i)) into discrete levels and forming preference pairs:

P={(i,j):rq(xi)>rq(xj)}\mathcal{P} = \{(i,j) : r_q(x_i) > r_q(x_j)\}

  • Training a Rank-SVM (pairwise hinge loss minimization):

mingH,ξij012gH2+C(i,j)Pξijs.t.g(xi)g(xj)1ξij\min_{g \in H, \xi_{ij} \geq 0} \frac{1}{2} \|g\|_H^2 + C \sum_{(i,j) \in \mathcal{P}} \xi_{ij} \quad \text{s.t.} \quad g(x_i) - g(x_j) \geq 1 - \xi_{ij}

The learned function g()g(\cdot) serves as a fast test-time surrogate for the empirical anomaly score. For a test point η\eta, the surrogate empirical p-value is

r(η)=1ni=1n1{g(xi)g(η)}r(\eta) = \frac{1}{n} \sum_{i=1}^n \mathbf{1}\{g(x_i) \leq g(\eta)\}

Anomaly is declared if r(η)αr(\eta) \leq \alpha (Qian et al., 2014, Root et al., 2016, Qian et al., 2015).

4. Theoretical Guarantees: Optimality, Consistency, and False-Alarm Control

Percentile-based anomaly scoring methods possess rigorous statistical properties:

  • Asymptotic Consistency: For large nn and under weak regularity conditions, the empirical percentile functions r(x)r(x) and their learned surrogates converge almost surely to the true pp-value function p(x)p(x). The decision region {η:r(η)α}\{\eta : r(\eta) \leq \alpha\} converges to the α\alpha-level minimum-volume set (Qian et al., 2014, 0910.5461, Root et al., 2016, Qian et al., 2015).
  • Uniformly Most Powerful (UMP): The test ϕn(η)=1{r(η)α}\phi_n(\eta) = \mathbf{1}\{r(\eta) \leq \alpha\} is asymptotically UMP of level α\alpha for mixture models f=(1π)f0+πf1f = (1-\pi)f_0 + \pi f_1 (0910.5461).
  • Explicit False-Alarm Rate Control: The empirical false-alarm (type I error) rate coincides with the nominal threshold: Pηf0(r(η)α)αP_{\eta \sim f_0}(r(\eta) \leq \alpha) \approx \alpha. This property is robust across synthetic and real-world datasets (Qian et al., 2014, 0910.5461, Root et al., 2016).

5. Algorithmic and Computational Considerations

Training and Testing Complexity

Method Training Test Time (per point)
KK-NN percentile, no learning O(n2d+n2)O(n^2 d + n^2) O(nd)O(nd)
Rank-SVM approximation O(n2d+n2+T)O(n^2 d + n^2 + T) O(sgd+logn)O(s_g d + \log n)
  • sgs_g is the number of support pairs in the learned function; TT is the optimizer cost for Rank-SVM (Qian et al., 2014, Root et al., 2016).
  • Pure KK-NN approaches are computationally prohibitive for large nn at test time, while the Rank-SVM/scoring function surrogates substantially reduce per-query complexity (often by orders of magnitude).

Pseudocode Illustration

A canonical training-testing cycle using Rank-SVM is:

1
2
3
4
5
6
7
8
9
10
For each x_i in X:
    Compute G(x_i) = -(1/K) * sum_{k=1}^K distance(x_i, k-th neighbor)
    Compute empirical rank r(x_i) = (1/n) * sum_j 1{G(x_j) <= G(x_i)}
Quantize r(x_i); form preference pairs (i, j) with r_q(x_i) > r_q(x_j)
Train Rank-SVM over preferences to obtain scoring function g
Sort {g(x_i): i=1..n}
For each test η:
    Compute score s = g(η)
    Estimate r(η) = (1/n) * sum_i 1{g(x_i) <= s}
    Declare anomaly if r(η)  α
(Qian et al., 2014)

6. Empirical Performance and Domain Applications

Percentile-based anomaly scoring approaches have been systematically evaluated on both synthetic and real benchmark datasets, as well as structured spatiotemporal domains:

  • Synthetic Two-Component Gaussian Mixtures: Level-set contours induced by percentile scoring closely match the ground truth; AUC scores approach Bayes-optimality (Qian et al., 2014, Qian et al., 2015).
  • Real-World Benchmarks: On datasets including Shuttle, HTTP, SMTP, Banknote Authentication, Magic Gamma, and Forest CoverType, RankAD and analogous methods achieved the highest or near-highest AUC among tested baseline algorithms. Test time is significantly lower than BP-KNNG, aK-LPE, and competitive with one-class SVM, while outperforming Isolation Forest and MassAD statistically (Qian et al., 2014, Qian et al., 2015, Root et al., 2016).
  • Crowd Anomaly Detection: In dense-crowd video, empirical velocity distributions are used with percentile-based anomaly scoring to detect abnormal motion in real-time, with high precision and under 5% false positives on challenging scenes (AlGhamdi et al., 21 Oct 2025).

7. Adaptivity, Parameterization, and Methodological Scope

Percentile-based approaches inherit several desirable properties:

  • Adaptivity to Local Structure: By operating on local density surrogates (K-NN distances, graph degrees), the tests naturally adapt to variations in data geometry, manifold structure, and intrinsic dimensionality without tuning global density estimators (0910.5461, Root et al., 2016).
  • Minimal Parameter Burden: Core parameters are K (nearest neighbors) or ϵ\epsilon (graph radius); these can be set via standard heuristics (e.g., Kn2/5K \sim n^{2/5}). No explicit density models or bandwidth selection is needed.
  • Interpretable and Generalizable: The percentile threshold directly reflects the desired error rate. Distribution assumptions are minimal, yielding applicability across heterogeneous domains, including high-dimensional tabular data and structured motion analysis (AlGhamdi et al., 21 Oct 2025).

References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Percentile-Based Anomaly Scoring.