Papers
Topics
Authors
Recent
Search
2000 character limit reached

bigMICE: Multiple Imputation of Big Data

Published 29 Jan 2026 in stat.CO and cs.DC | (2601.21613v1)

Abstract: Missing data is a prevalent issue in many applications, including large medical registries such as the Swedish Healthcare Quality Registries, potentially leading to biased or inefficient analyses if not handled properly. Multiple Imputation by Chained Equations (MICE) is a popular and versatile method for handling multivariate missing data but traditional implementations face significant challenges when applied to big data sets due to computational time and memory limitations. To address this, the bigMICE package was developed, adapting the MICE framework to big data using Apache Spark MLLib and Spark ML. Our implementation allows for controlling the maximum memory usage during the execution, enabling processing of very large data sets on a hardware with a limited memory, such as ordinary laptops. The developed package was tested on a large Swedish medical registry to measure memory usage, runtime and dependence of the imputation quality on sample size and on missingness proportion in the data. In conclusion, our method is generally more memory efficient and faster on large data sets compared to a commonly used MICE implementation. We also demonstrate that working with very large datasets can result in high quality imputations even when a variable has a large proportion of missing data. This paper also provides guidelines and recommendations on how to install and use our open source package.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.