Numenta Anomaly Benchmark (NAB)

Updated 4 February 2026

NAB is an open-source framework for evaluating real-time anomaly detection on streaming univariate time-series, emphasizing early detection and low false-alarm rates.
It provides a curated dataset of 58 labeled time-series with adaptive anomaly windows and a stream-aware scoring system to reward prompt detections.
The benchmark ensures reproducible evaluations through a standardized Detector API, threshold optimization routines, and comparative analyses across multiple application profiles.

The Numenta Anomaly Benchmark (NAB) is an open-source, community-driven framework for evaluating real-time anomaly detection algorithms on streaming, univariate time-series with a pronounced emphasis on early detection, low false-alarm rates, and adaptation to non-stationarity. Developed to address the lack of standardized, repeatable evaluation for streaming anomaly detectors, NAB comprises a publicly available, hand-labeled corpus of challenging real-world and synthetic time-series spanning multiple application domains—including IT metrics, industrial sensors, and finance—and provides a rigorously designed, streaming-aware scoring methodology that rewards prompt and robust detection without access to ground-truth in deployment settings (Lavin et al., 2015, Ahmad et al., 2016).

1. Benchmark Structure and Datasets

NAB consists of 58 labeled univariate time-series streams, totaling 365,551 records with individual file lengths ranging from 1,000 to 22,000 (Lavin et al., 2015). Data domains include cloud/server metrics (CPU, memory, network), industrial measurements, financial/economic series, social media, synthetic control charts, lab sensors, and controls with no anomalies. Each stream is partitioned into an initial "auto-calibration" (probationary) period (15% of the series, ignored for scoring and hyperparameter tuning) and an evaluation period for anomaly detection (Ahmad et al., 2016).

Anomalies are labeled through a consensus procedure and are annotated as temporal windows—intervals during which any timely detection is counted as a true positive (TP). Both point and temporal anomalies are included; new, sustained regimes of behavior are relabeled as “normal” to reflect evolving baselines (Lavin et al., 2015, Karami et al., 13 Oct 2025). The corpus is curated to challenge detectors with noisy, multiscale, non-stationary, and quasi-periodic changes—mimicking operational data-stream environments.

2. Scoring Methodology and Evaluation Principles

Traditional metrics like static precision and recall are inadequate for assessing streaming anomaly detectors, as they are insensitive to detection latency and cannot capture the trade-off between early alerting and alarm fatigue (Lavin et al., 2015). NAB introduces a stream-aware scoring system based on pre-defined anomaly windows and time-sensitive detection reward functions.

Anomaly Windows and Rewards

For each anomaly label, NAB defines a "window" of adaptive length proportional to series size and anomaly frequency. Any detection inside this window is a TP (only the first counts), detections outside are false positives (FPs), and missed windows are false negatives (FNs).

The core scoring function for a detection at time $t_d$ inside a window $[t_s, t_e]$ is:

$s(t_d) = 1 - \frac{t_d - t_s}{t_e - t_s},$

implementing a linear decay in reward from 1 at window onset to 0 at the end (Ahmad et al., 2016). Detections outside any window receive zero reward and are penalized by a fixed cost $C_{fp}$ . The total raw score is

$S_{raw} = \sum_{tp} s(t_d) - C_{fp} \cdot N_{fp}.$

Application Profiles and Normalization

NAB operationalizes different deployment cost models via "application profiles," which weight FNs and FPs differently (e.g., Standard, Low-FP, Low-FN profiles; see Table below). The final NAB score normalizes detector raw scores to a [0,100] scale using the perfect (100) and null (0) detector references:

$NAB\,Score = 100 \cdot \frac{S_{raw} - S_{null}}{S^*_{raw} - S_{null}}.$

This methodology robustly evaluates detectors under streaming constraints, quantifying both the detection earliness and the rate of false alarms (Lavin et al., 2015, Ahmad et al., 2016, Karami et al., 13 Oct 2025).

3. Detector Integration and Reproducible Workflow

NAB is distributed as an open-source framework (MIT License) on GitHub, offering [i] loaders for the full dataset, [ii] a Detector API supporting any language, [iii] reference implementations (HTM, Etsy Skyline, Twitter ADVec), [iv] threshold optimization routines (hill-climbing for a global threshold), and [v] a scoring engine with built-in support for all evaluation profiles (Lavin et al., 2015).

The canonical workflow for evaluating a detector in NAB consists of:

Implementing an online streaming detector that processes $(timestamp, value)$ pairs, returning an anomaly score in $[0,1]$ for each input.
Plugging the detector into NAB’s wrapper interface and registering it.
Running threshold optimization to select a global threshold maximizing the NAB score across all files (single parameterization is required).
Generating anomaly calls and computing scores for each series and for aggregate profiles.
Producing breakdowns of TPs, FPs, FNs, and detailed timing statistics for post-hoc analysis.

This design ensures cross-method comparability and reproducibility—a critical property for benchmarking in streaming analytics (Lavin et al., 2015, Ahmad et al., 2016).

4. Published Results and Algorithmic Comparisons

Quantitative performance on the full NAB corpus has been reported for a variety of algorithms, including Numenta’s Hierarchical Temporal Memory (HTM), Twitter ADVec, Etsy Skyline, Bayesian Change Point, and sliding threshold baselines:

Algorithm	NAB Score	Low-FP	Low-FN
Perfect	100.0	100.0	100.0
HTM (Numenta)	65.3	58.6	69.4
Twitter ADVec	47.1	33.6	53.5
Etsy Skyline	35.7	27.1	44.5
Bayesian Change Point	17.7	3.2	32.2
Sliding Threshold	15.0	0.0	30.1
Random	11.0	1.2	19.5

HTM achieves the highest aggregate NAB scores and demonstrates superior ability to detect both sharp and subtle (temporal) anomalies, adapting to changing baselines without alarm fatigue (Ahmad et al., 2016). Further studies establish that conformal k-NN methods, residual-based LSTM models, and forecasting-transformer hybrids (Informer) also perform competitively depending on stream complexity and resource constraints (Ishimtsev et al., 2017, Lee et al., 2020, Karami et al., 13 Oct 2025).

The decisive factor in detection performance is the forecasting model's quality rather than the downstream detection method; e.g., LSTM achieves an F1 of 0.688 (ranking top-2 in 81% of files), Informer delivers similar accuracy at 30% faster training, and classical models (Holt-Winters, SARIMA) remain cost-effective for synthetic/periodic data but degrade on operational streams (Karami et al., 13 Oct 2025).

5. Algorithmic Features and Lessons from the Benchmark

Detectors evaluated with NAB exhibit key differentiators:

Continuous online learning (e.g., HTM, conformal k-NN): Models must adapt their internal parameters per-record, handling diurnal, workload, or regime shifts automatically without batch retraining or manual tuning (Ahmad et al., 2016, Ishimtsev et al., 2017).
Temporal modeling: Sequence-extrapolative architectures (HTM, LSTM, Informer) predict future inputs in context, thereby detecting change-of-pattern anomalies that static or spatial-only methods (thresholding, ARIMA) consistently miss (Ahmad et al., 2016, Karami et al., 13 Oct 2025).
Noise robustness: Effective approaches smooth anomaly likelihoods across rolling windows and suppress isolated spikes, reducing spurious alarms in noisy real-world streams (Ahmad et al., 2016, Karami et al., 13 Oct 2025).
Parameter economy: Leading methods (HTM, conformal prediction) use minimal parameterization—global thresholds, window sizes—and require no per-series tuning, supporting fully automated deployment (Ahmad et al., 2016, Ishimtsev et al., 2017).
Code and workflow standardization: The GitHub-based ecosystem with language-agnostic wrappers and built-in optimization tools streamlines method comparison and leaderboard reproducibility (Lavin et al., 2015).
Scoring-specific insights: The time-decaying (or sigmoidal) reward mechanism strongly incentivizes early detection and penalizes duplicate or late alarms, shaping both model design and hyperparameter selection (Lavin et al., 2015).

6. Limitations, Open Questions, and Future Directions

NAB’s highly structured design also reveals methodological frontiers and challenges:

The Gaussian modeling in HTM is a practical but imperfect fit for genuine anomaly-score distributions; alternative (e.g., nonparametric or heavy-tailed) statistical models may suppress false alarms further (Ahmad et al., 2016).
Top-performing detectors often exhibit complementary error patterns; ensemble or hybrid approaches may leverage this diversity to achieve higher aggregate scores (Ahmad et al., 2016, Ishimtsev et al., 2017, Karami et al., 13 Oct 2025).
The current normalization assumes a single global threshold and univariate processing; real-world deployments may require explicit multivariate or cross-stream modeling (Ahmad et al., 2016, Karami et al., 13 Oct 2025).
While the detection pipeline is robust for point-wise and interval anomalies, more complex anomaly types might warrant extensions in labeling and scoring schemas (Karami et al., 13 Oct 2025).
The exchangeability assumption underlying conformal prediction is theoretically violated in dependent time series; further analysis is required to formalize empirical calibration behavior (Ishimtsev et al., 2017).
Benchmark representativity is bounded by its 58 series; extending the corpus to cover additional verticals, higher dimensions, or new anomaly classes is an ongoing community need (Lavin et al., 2015, Karami et al., 13 Oct 2025).
A plausible implication is that holistic integration of forecasting, detection, and ensemble methodologies—coupled with resource-aware deployment planning—will dictate future improvements in operational real-time anomaly detection.

7. Impact and Guidance for Researchers

NAB is now a de facto standard for evaluating real-time, streaming anomaly detectors (Lavin et al., 2015, Ahmad et al., 2016, Karami et al., 13 Oct 2025). Its rigorous design supports:

Methodological research—by exposing detectors to dynamic, labeled, noisy, and non-stationary time-series in a reproducible, comparable environment.
Empirical validation—by providing exhaustive workflow automation, leaderboard baselines, and cost-profile selection.
Practical deployment—via defaults that match unsupervised system constraints, automation of parameter tuning, and fine-grained analysis of performance trade-offs.

Researchers are encouraged to leverage NAB for transparent benchmarking, to develop independent or ensemble models suited to streaming operational data, and to contribute new datasets, detectors, or evaluation profiles to the evolving open-source corpus.

Key References: (Lavin et al., 2015, Ahmad et al., 2016, Ishimtsev et al., 2017, Lee et al., 2020, Karami et al., 13 Oct 2025)

Markdown Report Issue Upgrade to Chat

References (5)

Evaluating Real-time Anomaly Detection Algorithms - the Numenta Anomaly Benchmark (2015)

Real-Time Anomaly Detection for Streaming Analytics (2016)

A Comprehensive Forecasting-Based Framework for Time Series Anomaly Detection: Benchmarking on the Numenta Anomaly Benchmark (NAB) (2025)

Conformal k-NN Anomaly Detector for Univariate Data Streams (2017)

RePAD: Real-time Proactive Anomaly Detection for Time Series (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Numenta Anomaly Benchmark (NAB).

Numenta Anomaly Benchmark (NAB)

1. Benchmark Structure and Datasets

2. Scoring Methodology and Evaluation Principles

Anomaly Windows and Rewards

Application Profiles and Normalization

3. Detector Integration and Reproducible Workflow

4. Published Results and Algorithmic Comparisons

5. Algorithmic Features and Lessons from the Benchmark

6. Limitations, Open Questions, and Future Directions

7. Impact and Guidance for Researchers

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Numenta Anomaly Benchmark (NAB)

1. Benchmark Structure and Datasets

2. Scoring Methodology and Evaluation Principles

Anomaly Windows and Rewards

Application Profiles and Normalization

3. Detector Integration and Reproducible Workflow

4. Published Results and Algorithmic Comparisons

5. Algorithmic Features and Lessons from the Benchmark

6. Limitations, Open Questions, and Future Directions

7. Impact and Guidance for Researchers

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research