bigMICE: Scalable Imputation via Apache Spark
- bigMICE is an R package that extends the standard MICE imputation method to handle big data by embedding its iterative conditional-model framework into Apache Spark.
- It re-engineers the imputation process using Spark DataFrames, MLlib pipelines, and checkpointing to efficiently manage millions of rows and high-dimensional missing data.
- The package offers improved scalability and performance without compromising imputation quality, making it ideal for large-scale registry studies and real-world data applications.
bigMICE is an R package designed to extend Multiple Imputation by Chained Equations (MICE) to the big data context, embedding its iterative conditional-model framework directly into Apache Spark. Developed specifically to address the computational and memory bottlenecks of standard MICE in large-scale settings, bigMICE leverages Spark's distributed computing and off-heap memory management to efficiently process data sets containing millions of rows and dozens of columns on commodity hardware such as ordinary laptops. The package was benchmarked on large medical registries, delivering improved performance and scalability compared to traditional MICE implementations while preserving statistical validity for high-dimensional missing data (Morvan et al., 29 Jan 2026).
1. Computational Bottlenecks of Standard MICE
The standard MICE framework (Buuren & Groothuis-Oudshoorn, 2011) addresses multivariate missingness by generating completed datasets via iterations of chained conditional models for variables. At each iteration and for each variable , a model is fitted, followed by draws from the posterior predictive distribution: This approach incurs two critical limitations for large tables (e.g., millions of rows):
- Memory overhead: R-based MICE implementations require storing datasets or all model objects plus data in RAM, leading to infeasibility as table size scales.
- CPU and runtime: Each of the regression/classification fits is single-threaded or only parallelized over imputations, not rows, becoming prohibitive for GB–TB scale data (Morvan et al., 29 Jan 2026).
2. bigMICE Architecture and Implementation
bigMICE re-engineers the MICE workflow inside Spark using several architectural innovations:
- sparklyr bridge: Maintains an R interface, submitting computation to Spark clusters or local multicore setups.
- Spark DataFrames: Stores wide, tall tables as resilient distributed datasets, with Spark managing in-memory versus disk persistence via “off-heap” spilling.
- MLlib and Spark ML integration: Conditional models employ Spark learners (e.g., ml_linear_regression, ml_logistic_regression, ml_random_forest), enabling data-parallel model fitting over partitions.
- Checkpointing and persistence: After every variables (default ), lineage is truncated by writing intermediate results to disk or HDFS, capping memory usage and preventing DAG stack overflow.
- Efficient pooling: Only essential model summaries (coefficients and variance estimates per imputation) are retained for Rubin’s pooling, rather than complete datasets.
Spark memory control is managed by specifying driver memory (e.g., "10G") and the number of cores. Spark dynamically spills partitions to disk once the specified on-heap memory budget is reached. Recommended settings include and judicious checkpointing frequency to balance RAM use and I/O overhead (Morvan et al., 29 Jan 2026).
3. Mathematical Core of MICE via Chained Equations
Let be an data matrix with missing entries. For each variable at iteration , and letting denote the rest, the update cycle consists of:
For continuous variables with linear models ( residual variance): For categorical variables with classes, given predicted probabilities , draw and assign the class by: Each conditional model is implemented as a Spark ML pipeline: observed rows are filtered for fitting, predictions are computed on missing entries, and the imputed values are merged.
Although the updates per sweep are intrinsically sequential (forming a computational chain), bigMICE parallelizes each model fit over partitions and can launch imputations in parallel. This achieves significant wall-time reduction compared to single-threaded baseline MICE workflows (Morvan et al., 29 Jan 2026).
4. Installation, Configuration, and API Usage
System Requirements and Setup
bigMICE requires Java 8+, R 4.x, sparklyr, devtools, and optionally Hadoop HDFS for checkpointing. Installation steps:
1 2 3 4 |
install.packages("sparklyr")
sparklyr::spark_install(version="4.0.0") # tested Spark 4.0.0, sparklyr 1.9.1
install.packages("devtools")
devtools::install_github("bigcausallab/bigMICE") |
1 2 3 4 5 6 |
library(sparklyr); library(bigMICE)
conf <- spark_config()
conf$`sparklyr.shell.driver-memory` <- "10G"
conf$spark.memory.fraction <- 0.8
conf$`sparklyr.cores.local` <- 4
sc <- spark_connect("local", config=conf) |
1 |
spark_set_checkpoint_dir(sc, "/path/to/hdfs/dir") |
User-Facing API
bigMICE exposes two principal functions:
mice.spark(): Returns pooled analysis results and model parameters only (memory efficient).mice.spark.plus(): Returns imputed Spark DataFrames in addition (high memory overhead).
Key parameters include:
data: Spark DataFrame input.sc: Spark connection.variable_types: Mapping of columns to types (Continuous_float, Continuous_int, Binary, Nominal, Ordinal).analysis_formula: Regression formula describing the analysis model.m: Number of imputations (default 5).maxit: Chained-imputation iterations per imputation (default 5).checkpointing: Use of Spark/HDFS checkpointing (default TRUE).- Others:
predictorMatrix,method,seed,printFlag.
Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
library(sparklyr); library(dplyr); library(bigMICE)
conf <- spark_config()
conf$`sparklyr.shell.driver-memory` <- "10G"
conf$spark.memory.fraction <- 0.8
conf$`sparklyr.cores.local` <- 4
sc <- spark_connect("local", config = conf)
sdf <- spark_read_csv(sc, name="mydata", path="data.csv", header=TRUE, infer_schema=TRUE, null_value="NA")
variable_types <- c(age = "Continuous_int", bmi = "Continuous_float", gender = "Binary", phb = "Nominal", tv = "Continuous_int")
analysis_formula <- as.formula("phb ~ age + gender + tv + bmi")
res <- bigMICE::mice.spark(
data = sdf,
sc = sc,
variable_types = variable_types,
analysis_formula = analysis_formula,
m = 3,
maxit = 2,
checkpointing = FALSE
)
print(res) # pooled results
res$model_params # per-imputation parameters
res_full <- bigMICE::mice.spark.plus(...) # imputed datasets
imputed_dfs <- res_full$imputations |
5. Performance, Scalability, and Imputation Quality
bigMICE was benchmarked on subsets of the Swedish National Diabetes Registry (14.6 M rows × 10 variables).
| Method | Max RAM (16 GB cap) | Runtime (14M rows, 1 imputation × 5 iter) | RMSE (GFR, MCAR 50%) |
|---|---|---|---|
| Standard mice | Exceeds 16 GB at ~7M rows | >150 min | 1.70 (1K rows) → 1.62 (8M rows) |
| bigMICE | ≤16 GB (all sizes) | ≈37 min (14M rows) | 1.62 (8M rows) |
- Standard mice’s RAM scales linearly past 1M rows, exceeding 16GB around 7M rows, and becomes intractable at larger scales.
- bigMICE maintains RAM usage within the JVM driver allocation at all tested sizes via disk spilling and checkpointing.
- Wall-time for bigMICE on 14M rows is approximately 37 minutes — about one-third of mice’s time on smaller subsets.
- Imputation quality (RMSE) remains stable with increasing and high missingness proportions. For M and missingness 10%→99.9%, RMSE is stable between 1.61–1.63, with only a modest increase to 1.71 at near-complete missingness (Morvan et al., 29 Jan 2026).
6. Practical Recommendations and Limitations
Recommended usage patterns:
- Memory settings: Set
sparklyr.shell.driver-memory≤ available RAM (e.g., "10G" for a 16GB laptop),spark.memory.fraction ≈ 0.8. - Checkpointing: Enable checkpointing (default TRUE) and adjust frequency (e.g., every 10 variables) to manage Spark DAG size and disk I/O.
- Imputation count (m): For large , increases in beyond 5 yield limited Monte Carlo error reduction but increase computational cost.
- Model selection: Prefer scalable Spark MLLib learners (linear/logistic models) for speed; random forest offers higher model complexity at the cost of slower runtimes.
- Data output: Use
mice.spark.plus()sparingly if completed imputed datasets are needed; otherwise, rely on pooled results for efficiency.
In large-scale registry settings, reliable imputation is obtained for variables with high missingness (>70%) when sample size is in the millions. bigMICE thus enables statistically robust MICE-based inference at scale, with an API and configuration model accessible to R practitioners and familiar to users of existing MICE workflows (Morvan et al., 29 Jan 2026).