Papers
Topics
Authors
Recent
Search
2000 character limit reached

MEDS-Tab: Automated tabularization and baseline methods for MEDS datasets

Published 31 Oct 2024 in cs.LG | (2411.00200v1)

Abstract: Effective, reliable, and scalable development of ML solutions for structured electronic health record (EHR) data requires the ability to reliably generate high-quality baseline models for diverse supervised learning tasks in an efficient and performant manner. Historically, producing such baseline models has been a largely manual effort--individual researchers would need to decide on the particular featurization and tabularization processes to apply to their individual raw, longitudinal data; and then train a supervised model over those data to produce a baseline result to compare novel methods against, all for just one task and one dataset. In this work, powered by complementary advances in core data standardization through the MEDS framework, we dramatically simplify and accelerate this process of tabularizing irregularly sampled time-series data, providing researchers the ability to automatically and scalably featurize and tabularize their longitudinal EHR data across tens of thousands of individual features, hundreds of millions of clinical events, and diverse windowing horizons and aggregation strategies, all before ultimately leveraging these tabular data to automatically produce high-caliber XGBoost baselines in a highly computationally efficient manner. This system scales to dramatically larger datasets than tabularization tools currently available to the community and enables researchers with any MEDS format dataset to immediately begin producing reliable and performant baseline prediction results on various tasks, with minimal human effort required. This system will greatly enhance the reliability, reproducibility, and ease of development of powerful ML solutions for health problems across diverse datasets and clinical settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2623–2631, 2019.
  2. Cardea: An open automated machine learning framework for electronic health records. In 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), pages 536–545. IEEE, 2020.
  3. Medical event data standard (MEDS): Facilitating machine learning for health. In ICLR 2024 Workshop on Learning from Time Series For Health, 2024.
  4. Performance analysis of xgboost classifier with missing data. In 1st Int. Conf. Comput. Mach. Intell., no, 2021.
  5. Hyperopt: a python library for model selection and hyperparameter optimization. Computational Science & Discovery, 8(1):014008, 2015.
  6. Deep neural networks and tabular data: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  7. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016.
  8. Time series feature extraction on basis of scalable hypothesis tests (tsfresh–a python package). Neurocomputing, 307:72–77, 2018.
  9. The choice of scaling technique matters for classification performance. Applied Soft Computing, 133:109924, 2023.
  10. Proceedings of the 8th Machine Learning for Healthcare Conference, volume 219 of Proceedings of Machine Learning Research. PMLR.
  11. Privacy-preserving patient clustering for personalized federated learnings. In Kaivalya Deshpande, Madalina Fiterau, Shalmali Joshi, Zachary Lipton, Rajesh Ranganath, Iñigo Urteaga, and Serene Yeung, editors, Proceedings of the 8th Machine Learning for Healthcare Conference, volume 219 of Proceedings of Machine Learning Research, pages 150–166. PMLR, 11–12 Aug 2023.
  12. Autogluon-tabular: Robust and accurate automl for structured data. arXiv preprint arXiv:2003.06505, 2020.
  13. Auto-sklearn 2.0: The next generation. arXiv preprint arXiv:2007.04074, 24:8, 2020.
  14. Hint: Hierarchical interaction network for clinical-trial-outcome predictions. Patterns, 3(4), 2022.
  15. Multitask learning and benchmarking with clinical time series data. Scientific data, 6(1):96, 2019.
  16. Trevor Hastie. The elements of statistical learning: data mining, inference, and prediction, 2009.
  17. Machine learning for health (ml4h) 2023. In Stefan Hegselmann, Antonio Parziale, Divya Shanmugam, Shengpu Tang, Mercy Nyamewaa Asiedu, Serina Chang, Tom Hartvigsen, and Harvineet Singh, editors, Proceedings of the 3rd Machine Learning for Health Symposium, volume 225 of Proceedings of Machine Learning Research, pages 1–12. PMLR, 10 Dec 2023.
  18. Multi-view modelling of longitudinal health data for improved prognostication of colorectal cancer recurrence. In Kaivalya Deshpande, Madalina Fiterau, Shalmali Joshi, Zachary Lipton, Rajesh Ranganath, Iñigo Urteaga, and Serene Yeung, editors, Proceedings of the 8th Machine Learning for Healthcare Conference, volume 219 of Proceedings of Machine Learning Research, pages 265–284. PMLR, 11–12 Aug 2023.
  19. Hyperimpute: Generalized iterative imputation with automatic model selection. In International Conference on Machine Learning, pages 9916–9937. PMLR, 2022.
  20. Clairvoyance: A pipeline toolkit for medical time series. arXiv preprint arXiv:2310.18688, 2023.
  21. Mining electronic health records: towards better research applications and clinical care. Nature Reviews Genetics, 13(6):395–405, 2012.
  22. Event-based contrastive learning for medical time series. arXiv preprint arXiv:2312.10308, 2023.
  23. Mimic-iv. PhysioNet. Available online at: https://physionet. org/content/mimiciv/1.0/(accessed August 23, 2021), pages 49–55, 2020.
  24. Mimic-iv, a freely accessible electronic health record dataset. Scientific data, 10(1):1, 2023.
  25. Multimodal pretraining of medical time series and notes. In Stefan Hegselmann, Antonio Parziale, Divya Shanmugam, Shengpu Tang, Mercy Nyamewaa Asiedu, Serina Chang, Tom Hartvigsen, and Harvineet Singh, editors, Proceedings of the 3rd Machine Learning for Health Symposium, volume 225 of Proceedings of Machine Learning Research, pages 244–255. PMLR, 10 Dec 2023.
  26. Deep contextual clinical prediction with reverse distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 249–258, 2021.
  27. On the importance of step-wise embeddings for heterogeneous clinical time-series. In Stefan Hegselmann, Antonio Parziale, Divya Shanmugam, Shengpu Tang, Mercy Nyamewaa Asiedu, Serina Chang, Tom Hartvigsen, and Harvineet Singh, editors, Proceedings of the 3rd Machine Learning for Health Symposium, volume 225 of Proceedings of Machine Learning Research, pages 268–291. PMLR, 10 Dec 2023.
  28. Duett: Dual event time transformer for electronic health records. In Kaivalya Deshpande, Madalina Fiterau, Shalmali Joshi, Zachary Lipton, Rajesh Ranganath, Iñigo Urteaga, and Serene Yeung, editors, Proceedings of the 8th Machine Learning for Healthcare Conference, volume 219 of Proceedings of Machine Learning Research, pages 403–422. PMLR, 11–12 Aug 2023.
  29. Imputation of missing values for electronic health record laboratory data. NPJ digital medicine, 4(1):147, 2021.
  30. Steven Cheng-Xian Li and Benjamin Marlin. Learning from irregularly-sampled time series: A missing data perspective. In International Conference on Machine Learning, pages 5937–5946. PMLR, 2020.
  31. sktime: A unified interface for machine learning with time series. arXiv preprint arXiv:1909.07872, 2019.
  32. Catabra: Efficient analysis and predictive modeling of tabular data. In IFIP International Conference on Artificial Intelligence Applications and Innovations, pages 57–68. Springer, 2023.
  33. A comprehensive evaluation of multi-task learning and multi-task pre-training on ehr time-series data. arXiv preprint arXiv:2007.10185, 2020.
  34. Clinical risk prediction with temporal probabilistic asymmetric multi-task learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9081–9091, 2021.
  35. Temporal supervised contrastive learning for modeling patient risk progression. In Stefan Hegselmann, Antonio Parziale, Divya Shanmugam, Shengpu Tang, Mercy Nyamewaa Asiedu, Serina Chang, Tom Hartvigsen, and Harvineet Singh, editors, Proceedings of the 3rd Machine Learning for Health Symposium, volume 225 of Proceedings of Machine Learning Research, pages 403–427. PMLR, 10 Dec 2023.
  36. Establishment of icu mortality risk prediction models with machine learning algorithm using mimic-iv database. Diagnostics, 12(5):1068, 2022.
  37. Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825–2830, 2011.
  38. Conference on health, inference, and learning (chil) 2024. In Tom Pollard, Edward Choi, Pankhuri Singhal, Michael Hughes, Elena Sizikova, Bobak Mortazavi, Irene Chen, Fei Wang, Tasmie Sarker, Matthew McDermott, and Marzyeh Ghassemi, editors, Proceedings of the fifth Conference on Health, Inference, and Learning, volume 248 of Proceedings of Machine Learning Research, pages 1–6. PMLR, 27–28 Jun 2024.
  39. The eICU Collaborative Research Database, a freely available multi-center database for critical care research. Scientific data, 5(1):1–13, 2018.
  40. Machine learning in health care and laboratory medicine: General overview of supervised learning and auto-ml. International Journal of Laboratory Hematology, 43:15–22, 2021.
  41. Multipar: Supervised irregular tensor factorization with multi-task learning for computational phenotyping. In Stefan Hegselmann, Antonio Parziale, Divya Shanmugam, Shengpu Tang, Mercy Nyamewaa Asiedu, Serina Chang, Tom Hartvigsen, and Harvineet Singh, editors, Proceedings of the 3rd Machine Learning for Health Symposium, volume 225 of Proceedings of Machine Learning Research, pages 498–511. PMLR, 10 Dec 2023.
  42. Xgboost in handling missing values for life insurance risk prediction. SN Applied Sciences, 2(8):1336, 2020.
  43. Evgeny S Saveliev and Mihaela van der Schaar. Temporai: Facilitating machine learning innovation in time domain tasks for medicine. arXiv preprint arXiv:2301.12260, 2023.
  44. Tabular data: Deep learning is not all you need. In 8th ICML Workshop on Automated Machine Learning (AutoML), 2021.
  45. Automated machine learning (automl): an overview of opportunities for application and research. Journal of Information Technology Case and Application Research, 24(2):75–85, 2022.
  46. The machine learning bazaar: Harnessing the ml ecosystem for effective system development. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pages 785–800, 2020.
  47. The effects of midazolam or propofol plus fentanyl on icu mortality: a retrospective study based on the mimic-iv database. Annals of Translational Medicine, 10(4), 2022.
  48. Prediction model of in-hospital mortality in intensive care unit patients with cardiac arrest: a retrospective analysis of mimic-iv database based on machine learning. BMC anesthesiology, 23(1):178, 2023.
  49. Self-supervised transformer for sparse and irregularly sampled multivariate clinical time-series. ACM Transactions on Knowledge Discovery from Data (TKDD), 16(6):1–17, 2022.
  50. Interpretable survival analysis for heart failure risk prediction. In Stefan Hegselmann, Antonio Parziale, Divya Shanmugam, Shengpu Tang, Mercy Nyamewaa Asiedu, Serina Chang, Tom Hartvigsen, and Harvineet Singh, editors, Proceedings of the 3rd Machine Learning for Health Symposium, volume 225 of Proceedings of Machine Learning Research, pages 574–593. PMLR, 10 Dec 2023.
  51. Defining and measuring completeness of electronic health records for secondary use. Journal of biomedical informatics, 46(5):830–836, 2013.
  52. Strategies for handling missing data in electronic health record derived data. Egems, 1(3), 2013.
  53. From basic to extra features: Hypergraph transformer pretrain-then-finetuning for balanced clinical predictions on ehr. In Tom Pollard, Edward Choi, Pankhuri Singhal, Michael Hughes, Elena Sizikova, Bobak Mortazavi, Irene Chen, Fei Wang, Tasmie Sarker, Matthew McDermott, and Marzyeh Ghassemi, editors, Proceedings of the fifth Conference on Health, Inference, and Learning, volume 248 of Proceedings of Machine Learning Research, pages 182–197. PMLR, 27–28 Jun 2024.
  54. Transehr: Self-supervised transformer for clinical time series data. In Stefan Hegselmann, Antonio Parziale, Divya Shanmugam, Shengpu Tang, Mercy Nyamewaa Asiedu, Serina Chang, Tom Hartvigsen, and Harvineet Singh, editors, Proceedings of the 3rd Machine Learning for Health Symposium, volume 225 of Proceedings of Machine Learning Research, pages 623–635. PMLR, 10 Dec 2023.
  55. Dynamic survival analysis for early event prediction. In Tom Pollard, Edward Choi, Pankhuri Singhal, Michael Hughes, Elena Sizikova, Bobak Mortazavi, Irene Chen, Fei Wang, Tasmie Sarker, Matthew McDermott, and Marzyeh Ghassemi, editors, Proceedings of the fifth Conference on Health, Inference, and Learning, volume 248 of Proceedings of Machine Learning Research, pages 540–557. PMLR, 27–28 Jun 2024.
  56. Semi-supervised meta-learning for multi-source heterogeneity in time-series data. In Kaivalya Deshpande, Madalina Fiterau, Shalmali Joshi, Zachary Lipton, Rajesh Ranganath, Iñigo Urteaga, and Serene Yeung, editors, Proceedings of the 8th Machine Learning for Healthcare Conference, volume 219 of Proceedings of Machine Learning Research, pages 923–941. PMLR, 11–12 Aug 2023.
  57. Pyhealth: A python library for health predictive models. arXiv preprint arXiv:2101.04209, 2021.
  58. Learning deep representations from heterogeneous patient data for predictive diagnosis. In Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pages 115–123, 2017.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.