Numenta Anomaly Benchmark (NAB)
- NAB is an open-source framework for evaluating real-time anomaly detection on streaming univariate time-series, emphasizing early detection and low false-alarm rates.
- It provides a curated dataset of 58 labeled time-series with adaptive anomaly windows and a stream-aware scoring system to reward prompt detections.
- The benchmark ensures reproducible evaluations through a standardized Detector API, threshold optimization routines, and comparative analyses across multiple application profiles.
The Numenta Anomaly Benchmark (NAB) is an open-source, community-driven framework for evaluating real-time anomaly detection algorithms on streaming, univariate time-series with a pronounced emphasis on early detection, low false-alarm rates, and adaptation to non-stationarity. Developed to address the lack of standardized, repeatable evaluation for streaming anomaly detectors, NAB comprises a publicly available, hand-labeled corpus of challenging real-world and synthetic time-series spanning multiple application domains—including IT metrics, industrial sensors, and finance—and provides a rigorously designed, streaming-aware scoring methodology that rewards prompt and robust detection without access to ground-truth in deployment settings (Lavin et al., 2015, &&&1&&&).
1. Benchmark Structure and Datasets
NAB consists of 58 labeled univariate time-series streams, totaling 365,551 records with individual file lengths ranging from 1,000 to 22,000 (Lavin et al., 2015). Data domains include cloud/server metrics (CPU, memory, network), industrial measurements, financial/economic series, social media, synthetic control charts, lab sensors, and controls with no anomalies. Each stream is partitioned into an initial "auto-calibration" (probationary) period (15% of the series, ignored for scoring and hyperparameter tuning) and an evaluation period for anomaly detection (Ahmad et al., 2016).
Anomalies are labeled through a consensus procedure and are annotated as temporal windows—intervals during which any timely detection is counted as a true positive (TP). Both point and temporal anomalies are included; new, sustained regimes of behavior are relabeled as “normal” to reflect evolving baselines (Lavin et al., 2015, Karami et al., 13 Oct 2025). The corpus is curated to challenge detectors with noisy, multiscale, non-stationary, and quasi-periodic changes—mimicking operational data-stream environments.
2. Scoring Methodology and Evaluation Principles
Traditional metrics like static precision and recall are inadequate for assessing streaming anomaly detectors, as they are insensitive to detection latency and cannot capture the trade-off between early alerting and alarm fatigue (Lavin et al., 2015). NAB introduces a stream-aware scoring system based on pre-defined anomaly windows and time-sensitive detection reward functions.
Anomaly Windows and Rewards
For each anomaly label, NAB defines a "window" of adaptive length proportional to series size and anomaly frequency. Any detection inside this window is a TP (only the first counts), detections outside are false positives (FPs), and missed windows are false negatives (FNs).
The core scoring function for a detection at time inside a window is:
implementing a linear decay in reward from 1 at window onset to 0 at the end (Ahmad et al., 2016). Detections outside any window receive zero reward and are penalized by a fixed cost . The total raw score is
Application Profiles and Normalization
NAB operationalizes different deployment cost models via "application profiles," which weight FNs and FPs differently (e.g., Standard, Low-FP, Low-FN profiles; see Table below). The final NAB score normalizes detector raw scores to a [0,100] scale using the perfect (100) and null (0) detector references:
This methodology robustly evaluates detectors under streaming constraints, quantifying both the detection earliness and the rate of false alarms (Lavin et al., 2015, Ahmad et al., 2016, Karami et al., 13 Oct 2025).
3. Detector Integration and Reproducible Workflow
NAB is distributed as an open-source framework (MIT License) on GitHub, offering [i] loaders for the full dataset, [ii] a Detector API supporting any language, [iii] reference implementations (HTM, Etsy Skyline, Twitter ADVec), [iv] threshold optimization routines (hill-climbing for a global threshold), and [v] a scoring engine with built-in support for all evaluation profiles (Lavin et al., 2015).
The canonical workflow for evaluating a detector in NAB consists of:
- Implementing an online streaming detector that processes pairs, returning an anomaly score in for each input.
- Plugging the detector into NAB’s wrapper interface and registering it.
- Running threshold optimization to select a global threshold maximizing the NAB score across all files (single parameterization is required).
- Generating anomaly calls and computing scores for each series and for aggregate profiles.
- Producing breakdowns of TPs, FPs, FNs, and detailed timing statistics for post-hoc analysis.
This design ensures cross-method comparability and reproducibility—a critical property for benchmarking in streaming analytics (Lavin et al., 2015, Ahmad et al., 2016).
4. Published Results and Algorithmic Comparisons
Quantitative performance on the full NAB corpus has been reported for a variety of algorithms, including Numenta’s Hierarchical Temporal Memory (HTM), Twitter ADVec, Etsy Skyline, Bayesian Change Point, and sliding threshold baselines:
| Algorithm | NAB Score | Low-FP | Low-FN |
|---|---|---|---|
| Perfect | 100.0 | 100.0 | 100.0 |
| HTM (Numenta) | 65.3 | 58.6 | 69.4 |
| Twitter ADVec | 47.1 | 33.6 | 53.5 |
| Etsy Skyline | 35.7 | 27.1 | 44.5 |
| Bayesian Change Point | 17.7 | 3.2 | 32.2 |
| Sliding Threshold | 15.0 | 0.0 | 30.1 |
| Random | 11.0 | 1.2 | 19.5 |
HTM achieves the highest aggregate NAB scores and demonstrates superior ability to detect both sharp and subtle (temporal) anomalies, adapting to changing baselines without alarm fatigue (Ahmad et al., 2016). Further studies establish that conformal k-NN methods, residual-based LSTM models, and forecasting-transformer hybrids (Informer) also perform competitively depending on stream complexity and resource constraints (Ishimtsev et al., 2017, Lee et al., 2020, Karami et al., 13 Oct 2025).
The decisive factor in detection performance is the forecasting model's quality rather than the downstream detection method; e.g., LSTM achieves an F1 of 0.688 (ranking top-2 in 81% of files), Informer delivers similar accuracy at 30% faster training, and classical models (Holt-Winters, SARIMA) remain cost-effective for synthetic/periodic data but degrade on operational streams (Karami et al., 13 Oct 2025).
5. Algorithmic Features and Lessons from the Benchmark
Detectors evaluated with NAB exhibit key differentiators:
- Continuous online learning (e.g., HTM, conformal k-NN): Models must adapt their internal parameters per-record, handling diurnal, workload, or regime shifts automatically without batch retraining or manual tuning (Ahmad et al., 2016, Ishimtsev et al., 2017).
- Temporal modeling: Sequence-extrapolative architectures (HTM, LSTM, Informer) predict future inputs in context, thereby detecting change-of-pattern anomalies that static or spatial-only methods (thresholding, ARIMA) consistently miss (Ahmad et al., 2016, Karami et al., 13 Oct 2025).
- Noise robustness: Effective approaches smooth anomaly likelihoods across rolling windows and suppress isolated spikes, reducing spurious alarms in noisy real-world streams (Ahmad et al., 2016, Karami et al., 13 Oct 2025).
- Parameter economy: Leading methods (HTM, conformal prediction) use minimal parameterization—global thresholds, window sizes—and require no per-series tuning, supporting fully automated deployment (Ahmad et al., 2016, Ishimtsev et al., 2017).
- Code and workflow standardization: The GitHub-based ecosystem with language-agnostic wrappers and built-in optimization tools streamlines method comparison and leaderboard reproducibility (Lavin et al., 2015).
- Scoring-specific insights: The time-decaying (or sigmoidal) reward mechanism strongly incentivizes early detection and penalizes duplicate or late alarms, shaping both model design and hyperparameter selection (Lavin et al., 2015).
6. Limitations, Open Questions, and Future Directions
NAB’s highly structured design also reveals methodological frontiers and challenges:
- The Gaussian modeling in HTM is a practical but imperfect fit for genuine anomaly-score distributions; alternative (e.g., nonparametric or heavy-tailed) statistical models may suppress false alarms further (Ahmad et al., 2016).
- Top-performing detectors often exhibit complementary error patterns; ensemble or hybrid approaches may leverage this diversity to achieve higher aggregate scores (Ahmad et al., 2016, Ishimtsev et al., 2017, Karami et al., 13 Oct 2025).
- The current normalization assumes a single global threshold and univariate processing; real-world deployments may require explicit multivariate or cross-stream modeling (Ahmad et al., 2016, Karami et al., 13 Oct 2025).
- While the detection pipeline is robust for point-wise and interval anomalies, more complex anomaly types might warrant extensions in labeling and scoring schemas (Karami et al., 13 Oct 2025).
- The exchangeability assumption underlying conformal prediction is theoretically violated in dependent time series; further analysis is required to formalize empirical calibration behavior (Ishimtsev et al., 2017).
- Benchmark representativity is bounded by its 58 series; extending the corpus to cover additional verticals, higher dimensions, or new anomaly classes is an ongoing community need (Lavin et al., 2015, Karami et al., 13 Oct 2025).
- A plausible implication is that holistic integration of forecasting, detection, and ensemble methodologies—coupled with resource-aware deployment planning—will dictate future improvements in operational real-time anomaly detection.
7. Impact and Guidance for Researchers
NAB is now a de facto standard for evaluating real-time, streaming anomaly detectors (Lavin et al., 2015, Ahmad et al., 2016, Karami et al., 13 Oct 2025). Its rigorous design supports:
- Methodological research—by exposing detectors to dynamic, labeled, noisy, and non-stationary time-series in a reproducible, comparable environment.
- Empirical validation—by providing exhaustive workflow automation, leaderboard baselines, and cost-profile selection.
- Practical deployment—via defaults that match unsupervised system constraints, automation of parameter tuning, and fine-grained analysis of performance trade-offs.
Researchers are encouraged to leverage NAB for transparent benchmarking, to develop independent or ensemble models suited to streaming operational data, and to contribute new datasets, detectors, or evaluation profiles to the evolving open-source corpus.
Key References: (Lavin et al., 2015, Ahmad et al., 2016, Ishimtsev et al., 2017, Lee et al., 2020, Karami et al., 13 Oct 2025)