Papers
Topics
Authors
Recent
Search
2000 character limit reached

Outlier Ranking in Large-Scale Public Health Streams

Published 2 Jan 2024 in cs.AI | (2401.01459v1)

Abstract: Disease control experts inspect public health data streams daily for outliers worth investigating, like those corresponding to data quality issues or disease outbreaks. However, they can only examine a few of the thousands of maximally-tied outliers returned by univariate outlier detection methods applied to large-scale public health data streams. To help experts distinguish the most important outliers from these thousands of tied outliers, we propose a new task for algorithms to rank the outputs of any univariate method applied to each of many streams. Our novel algorithm for this task, which leverages hierarchical networks and extreme value analysis, performed the best across traditional outlier detection metrics in a human-expert evaluation using public health data streams. Most importantly, experts have used our open-source Python implementation since April 2023 and report identifying outliers worth investigating 9.1x faster than their prior baseline. Other organizations can readily adapt this implementation to create rankings from the outputs of their tailored univariate methods across large-scale streams.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Biggerstaff, M. e. a. Results from the second year of a collaborative effort to forecast influenza seasons in the united states. Epidemics 24 (2018), 26–33.
  2. A review on outlier/anomaly detection in time series data. ACM Computing Surveys (CSUR) 54, 3 (2021), 1–33.
  3. Algorithms for rapid outbreak detection: a research synthesis. Journal of biomedical informatics 38, 2 (2005), 99–113.
  4. Essence, the electronic surveillance system for the early notification of community-based epidemics. medRxiv (2020), 2020–08.
  5. CDC. Introduction to public health surveillance. https://www.cdc.gov/training/publichealth101/surveillance.html, 09 2018. Accessed: 2023-06-05.
  6. Anomaly detection for iot time-series data: A survey. IEEE Internet of Things Journal 7, 7 (2019), 6481–6494.
  7. Algorithm aversion: people erroneously avoid algorithms after seeing them err. Journal of Experimental Psychology: General 144, 1 (2015), 114.
  8. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security (2017), pp. 1285–1298.
  9. Extreme value theory and statistics of univariate extremes: a review. International statistical review 83, 2 (2015), 263–292.
  10. Corrcorr: A feature selection method for multivariate correlation network anomaly detection techniques. Computers & Security 83 (2019), 234–245.
  11. Outlier detection for temporal data: A survey. IEEE Transactions on Knowledge and data Engineering 26, 9 (2013), 2250–2267.
  12. Healthcare, D. Hospital referral regions. https://www.definitivehc.com/resources/glossary/hospital-referral-region, 2023.
  13. HHS. Hhs regional offices. https://www.hhs.gov/about/agencies/iea/regional-offices/index.html, 2023.
  14. Detecting spacecraft anomalies using lstms and nonparametric dynamic thresholding. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining (2018), pp. 387–395.
  15. Correlation-based anomaly detection in industrial control systems. Sensors 23, 3 (2023), 1561.
  16. Anomaly detection in time series via robust pca. In 2017 2nd IEEE International Conference on Intelligent Transportation Engineering (ICITE) (2017), IEEE, pp. 352–355.
  17. Computationally assisted quality control for public health data streams. arXiv preprint arXiv:2306.16914 (2023).
  18. Large scale population-level outliers detection in public health data.
  19. Data curation during a pandemic and lessons learned from covid-19. Nature Computational Science 1, 1 (2021), 9–10.
  20. A space–time permutation scan statistic for disease outbreak detection. PLoS medicine 2, 3 (2005), e59.
  21. Tods: An automated time series outlier detection system. Proceedings of the AAAI Conference on Artificial Intelligence 35, 18 (May 2021), 16060–16062.
  22. Revisiting time series outlier detection: Definitions and benchmarks. In Thirty-fifth conference on neural information processing systems datasets and benchmarks track (2021).
  23. A note on statistical tests for homogeneities in multivariate extreme value models for block maxima. Environmetrics 33, 7 (2022), e2746.
  24. Isolation forest. In 2008 eighth ieee international conference on data mining (2008), IEEE, pp. 413–422.
  25. Pavao, A. ranky. https://github.com/didayolo/ranky, 2020.
  26. Intrinsic anomaly detection for multi-variate time series. arXiv preprint arXiv:2206.14342 (2022).
  27. An open repository of real-time covid-19 indicators. Proceedings of the National Academy of Sciences 118, 51 (2021), e2111452118.
  28. Rieder, H. E. Extreme value theory: A primer. Lamont-Doherty Earth Observatory (2014).
  29. Structural health monitoring by a novel probabilistic machine learning method based on extreme value theory and mixture quantile modeling. Mechanical Systems and Signal Processing 173 (2022), 109049.
  30. UN. United nationals global issues: Big data for sustainable development. https://www.un.org/en/global-issues/big-data-for-sustainable-development, 2021.
  31. WHO. Surveillence in emergencies. https://www.who.int/emergencies/surveillance, 2022.
  32. WHO. Who hub for pandemic and epidemic intelligence. https://pandemichub.who.int/publications/m/item/the-who-hub-for-pandemic-and-epidemic-intelligence-strategy-paper, Dec. 2022.
  33. WHO. Who hub for pandemic and epidemic intelligence. https://www.who.int/initiatives/preparedness-and-resilience-for-emerging-threats, Mar. 2023.
  34. Wong, W.-K. Data mining for early disease outbreak detection. Carnegie Mellon University, 2004.
  35. What’s strange about recent events (wsare): an algorithm for the early detection of disease outbreaks. The Journal of Machine Learning Research 6 (2005), 1961–1998.
  36. Current time series anomaly detection benchmarks are flawed and are creating the illusion of progress. IEEE Transactions on Knowledge and Data Engineering (2021).
  37. Spatial rank-based augmentation for nonparametric online monitoring and adaptive sampling of big data streams. Technometrics 65, 2 (2023), 243–256.
  38. Zhang, Z. On studying extreme values and systematic risks with nonlinear time series models and tail dependence measures. Statistical Theory and Related Fields 5, 1 (2021), 1–25.

Summary

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.