Percentile-Based Anomaly Scoring

Updated 4 February 2026

Percentile-based anomaly scoring is a nonparametric method that computes anomaly scores by quantifying data ranks and p-values from a nominal distribution.
It employs techniques like K-NN distances, graph degree statistics, and learning-to-rank surrogates to achieve efficient computation and precise false-alarm control.
The approach offers strong theoretical guarantees such as asymptotic consistency and UMP properties, making it effective for applications like fraud detection and crowd analysis.

Percentile-based anomaly scoring is a nonparametric framework for anomaly detection in high-dimensional data, centering on the use of empirical ranks or percentile statistics derived from nominal data distributions. It offers precise control over the false-alarm (type I error) rate by thresholding anomaly scores at specified percentiles, and underlies several scalable, theoretically-optimal algorithms in modern anomaly detection literature. Its principal instantiations—most notably nearest-neighbor-based empirical p-values and their learned ranking function surrogates—combine nonparametric density-adaptivity with computational efficiency and rigorous statistical guarantees (Qian et al., 2014, &&&1&&&, Root et al., 2016, Qian et al., 2015).

1. Formal Problem Definition and Core Statistical Principle

Percentile-based anomaly detection methods consider a nominal training set $X = \{x_i\}_{i=1}^n \subset \mathbb{R}^d$ sampled i.i.d. from an unknown density $f$ . For a desired false-alarm level $\alpha \in (0,1)$ , the canonical objective is to construct a test that, for any query $\eta \in \mathbb{R}^d$ , declares $\eta$ anomalous if its estimated $p$ -value $p(\eta) \leq \alpha$ , where

$p(\eta) = P\{x : f(x) \leq f(\eta)\} = \int_{\{x : f(x) \leq f(\eta)\}} f(x) dx$

This formulation implies that the acceptance region—the minimum-volume set of density measure $1-\alpha$ —is

$U_{1-\alpha} = \{x : p(x) \geq 1 - \alpha\}$

Anomalies are thus samples whose empirical $p$ -values fall within the $\alpha$ -percentile (lowest density) region (Qian et al., 2014).

2. Methodologies: Empirical Percentile Estimation from Nearest-Neighbor Graphs

Operationally, percentile-based anomaly scoring techniques estimate $p(\eta)$ or its surrogates via nonparametric statistics derived from the nominal set. The most prominent constructions utilize:

$K$ -Nearest-Neighbor (K-NN) Distance Statistics: For each $x \in X$ , compute $D_{(k)}(x)$ (distance to $k$ th nearest neighbor). Define a local score $G(x) = -\frac{1}{K} \sum_{k=1}^K D_{(k)}(x)$ . The empirical percentile rank is

$r(x) = \frac{1}{n} \sum_{j=1}^n \mathbf{1}\{G(x_j) \leq G(x)\}$

For a test point $\eta$ , $r(\eta)$ serves as its percentile-based anomaly score. Under mild regularity and $n \rightarrow \infty$ , $r(x) \rightarrow p(x)$ (Qian et al., 2014, 0910.5461).

$K$ -NN Graphs and Degree Statistics (ε-neighborhoods): Alternatively, use the degree $N_S(x)$ of node $x$ in an $\epsilon$ -neighborhood graph. For query $\eta$ , define

$\hat{p}_\epsilon(\eta) = \frac{1}{n} \sum_{i=1}^n \mathbf{1}\{N_S(\eta) \geq N_S(x_i)\}$

These statistics empirically estimate the p-values, yielding a uniform distribution under the null (0910.5461, Root et al., 2016).

Velocity-based Empirical Percentiles: In structured domains (e.g., crowd motion analysis), per-sample features such as normalized velocities are clustered, and empirical percentiles within semantic classes are used as thresholds for anomaly scores (AlGhamdi et al., 21 Oct 2025).

3. Learning-to-Rank Approximations and Rank-SVM

Given the computational inefficiency of direct $K$ -NN-based scoring at test time, percentile-based anomaly detectors frequently employ max-margin learning-to-rank surrogates in a reproducing kernel Hilbert space (RKHS). The process involves:

Quantizing empirical percentiles (e.g., $r(x_i)$ ) into discrete levels and forming preference pairs:

$\mathcal{P} = \{(i,j) : r_q(x_i) > r_q(x_j)\}$

Training a Rank-SVM (pairwise hinge loss minimization):

$\min_{g \in H, \xi_{ij} \geq 0} \frac{1}{2} \|g\|_H^2 + C \sum_{(i,j) \in \mathcal{P}} \xi_{ij} \quad \text{s.t.} \quad g(x_i) - g(x_j) \geq 1 - \xi_{ij}$

The learned function $g(\cdot)$ serves as a fast test-time surrogate for the empirical anomaly score. For a test point $\eta$ , the surrogate empirical p-value is

$r(\eta) = \frac{1}{n} \sum_{i=1}^n \mathbf{1}\{g(x_i) \leq g(\eta)\}$

Anomaly is declared if $r(\eta) \leq \alpha$ (Qian et al., 2014, Root et al., 2016, Qian et al., 2015).

4. Theoretical Guarantees: Optimality, Consistency, and False-Alarm Control

Percentile-based anomaly scoring methods possess rigorous statistical properties:

Asymptotic Consistency: For large $n$ and under weak regularity conditions, the empirical percentile functions $r(x)$ and their learned surrogates converge almost surely to the true $p$ -value function $p(x)$ . The decision region $\{\eta : r(\eta) \leq \alpha\}$ converges to the $\alpha$ -level minimum-volume set (Qian et al., 2014, 0910.5461, Root et al., 2016, Qian et al., 2015).
Uniformly Most Powerful (UMP): The test $\phi_n(\eta) = \mathbf{1}\{r(\eta) \leq \alpha\}$ is asymptotically UMP of level $\alpha$ for mixture models $f = (1-\pi)f_0 + \pi f_1$ (0910.5461).
Explicit False-Alarm Rate Control: The empirical false-alarm (type I error) rate coincides with the nominal threshold: $P_{\eta \sim f_0}(r(\eta) \leq \alpha) \approx \alpha$ . This property is robust across synthetic and real-world datasets (Qian et al., 2014, 0910.5461, Root et al., 2016).

5. Algorithmic and Computational Considerations

Training and Testing Complexity

Method	Training	Test Time (per point)
$K$ -NN percentile, no learning	$O(n^2 d + n^2)$	$O(nd)$
Rank-SVM approximation	$O(n^2 d + n^2 + T)$	$O(s_g d + \log n)$

$s_g$ is the number of support pairs in the learned function; $T$ is the optimizer cost for Rank-SVM (Qian et al., 2014, Root et al., 2016).
Pure $K$ -NN approaches are computationally prohibitive for large $n$ at test time, while the Rank-SVM/scoring function surrogates substantially reduce per-query complexity (often by orders of magnitude).

Pseudocode Illustration

A canonical training-testing cycle using Rank-SVM is:

For each x_i in X:
    Compute G(x_i) = -(1/K) * sum_{k=1}^K distance(x_i, k-th neighbor)
    Compute empirical rank r(x_i) = (1/n) * sum_j 1{G(x_j) <= G(x_i)}
Quantize r(x_i); form preference pairs (i, j) with r_q(x_i) > r_q(x_j)
Train Rank-SVM over preferences to obtain scoring function g
Sort {g(x_i): i=1..n}
For each test η:
    Compute score s = g(η)
    Estimate r(η) = (1/n) * sum_i 1{g(x_i) <= s}
    Declare anomaly if r(η) ≤ α

(Qian et al., 2014)

6. Empirical Performance and Domain Applications

Percentile-based anomaly scoring approaches have been systematically evaluated on both synthetic and real benchmark datasets, as well as structured spatiotemporal domains:

Synthetic Two-Component Gaussian Mixtures: Level-set contours induced by percentile scoring closely match the ground truth; AUC scores approach Bayes-optimality (Qian et al., 2014, Qian et al., 2015).
Real-World Benchmarks: On datasets including Shuttle, HTTP, SMTP, Banknote Authentication, Magic Gamma, and Forest CoverType, RankAD and analogous methods achieved the highest or near-highest AUC among tested baseline algorithms. Test time is significantly lower than BP-KNNG, aK-LPE, and competitive with one-class SVM, while outperforming Isolation Forest and MassAD statistically (Qian et al., 2014, Qian et al., 2015, Root et al., 2016).
Crowd Anomaly Detection: In dense-crowd video, empirical velocity distributions are used with percentile-based anomaly scoring to detect abnormal motion in real-time, with high precision and under 5% false positives on challenging scenes (AlGhamdi et al., 21 Oct 2025).

7. Adaptivity, Parameterization, and Methodological Scope

Percentile-based approaches inherit several desirable properties:

Adaptivity to Local Structure: By operating on local density surrogates (K-NN distances, graph degrees), the tests naturally adapt to variations in data geometry, manifold structure, and intrinsic dimensionality without tuning global density estimators (0910.5461, Root et al., 2016).
Minimal Parameter Burden: Core parameters are K (nearest neighbors) or $\epsilon$ (graph radius); these can be set via standard heuristics (e.g., $K \sim n^{2/5}$ ). No explicit density models or bandwidth selection is needed.
Interpretable and Generalizable: The percentile threshold directly reflects the desired error rate. Distribution assumptions are minimal, yielding applicability across heterogeneous domains, including high-dimensional tabular data and structured motion analysis (AlGhamdi et al., 21 Oct 2025).

References

"A Rank-SVM Approach to Anomaly Detection" (Qian et al., 2014)
"Anomaly Detection with Score functions based on Nearest Neighbor Graphs" (0910.5461)
"Learning Minimum Volume Sets and Anomaly Detectors from KNN Graphs" (Root et al., 2016)
"Learning Efficient Anomaly Detectors from $K$ -NN Graphs" (Qian et al., 2015)
"VelocityNet: Real-Time Crowd Anomaly Detection via Person-Specific Velocity Analysis" (AlGhamdi et al., 21 Oct 2025)