Papers
Topics
Authors
Recent
Search
2000 character limit reached

bigMICE: Scalable Imputation via Apache Spark

Updated 6 February 2026
  • bigMICE is an R package that extends the standard MICE imputation method to handle big data by embedding its iterative conditional-model framework into Apache Spark.
  • It re-engineers the imputation process using Spark DataFrames, MLlib pipelines, and checkpointing to efficiently manage millions of rows and high-dimensional missing data.
  • The package offers improved scalability and performance without compromising imputation quality, making it ideal for large-scale registry studies and real-world data applications.

bigMICE is an R package designed to extend Multiple Imputation by Chained Equations (MICE) to the big data context, embedding its iterative conditional-model framework directly into Apache Spark. Developed specifically to address the computational and memory bottlenecks of standard MICE in large-scale settings, bigMICE leverages Spark's distributed computing and off-heap memory management to efficiently process data sets containing millions of rows and dozens of columns on commodity hardware such as ordinary laptops. The package was benchmarked on large medical registries, delivering improved performance and scalability compared to traditional MICE implementations while preserving statistical validity for high-dimensional missing data (Morvan et al., 29 Jan 2026).

1. Computational Bottlenecks of Standard MICE

The standard MICE framework (Buuren & Groothuis-Oudshoorn, 2011) addresses multivariate missingness by generating mm completed datasets via TT iterations of chained conditional models for pp variables. At each iteration tt and for each variable jj, a model P(YjYj,θj)P(Y_j|Y_{-j}, \theta_j) is fitted, followed by draws from the posterior predictive distribution: Yj(t+1)fj(Yj(t),θj(t))+ϵjY_j^{(t+1)} \sim f_j(Y_{-j}^{(t)}, \theta_j^{(t)}) + \epsilon_j This approach incurs two critical limitations for large tables (e.g., millions of rows):

  • Memory overhead: R-based MICE implementations require storing pp datasets or all model objects plus data in RAM, leading to infeasibility as table size scales.
  • CPU and runtime: Each of the p×Tp \times T regression/classification fits is single-threaded or only parallelized over imputations, not rows, becoming prohibitive for GB–TB scale data (Morvan et al., 29 Jan 2026).

2. bigMICE Architecture and Implementation

bigMICE re-engineers the MICE workflow inside Spark using several architectural innovations:

  • sparklyr bridge: Maintains an R interface, submitting computation to Spark clusters or local multicore setups.
  • Spark DataFrames: Stores wide, tall tables as resilient distributed datasets, with Spark managing in-memory versus disk persistence via “off-heap” spilling.
  • MLlib and Spark ML integration: Conditional models employ Spark learners (e.g., ml_linear_regression, ml_logistic_regression, ml_random_forest), enabling data-parallel model fitting over partitions.
  • Checkpointing and persistence: After every KK variables (default K=10K=10), lineage is truncated by writing intermediate results to disk or HDFS, capping memory usage and preventing DAG stack overflow.
  • Efficient pooling: Only essential model summaries (coefficients and variance estimates per imputation) are retained for Rubin’s pooling, rather than complete datasets.

Spark memory control is managed by specifying driver memory (e.g., "10G") and the number of cores. Spark dynamically spills partitions to disk once the specified on-heap memory budget is reached. Recommended settings include spark.memory.fraction0.8spark.memory.fraction \approx 0.8 and judicious checkpointing frequency to balance RAM use and I/O overhead (Morvan et al., 29 Jan 2026).

3. Mathematical Core of MICE via Chained Equations

Let YY be an n×pn \times p data matrix with missing entries. For each variable YjY_j at iteration tt, and letting Yj(t)Y_{-j}^{(t)} denote the rest, the update cycle consists of: j=1p:\forall j=1 \ldots p:

θj(t)posterior(θjYjobs,Yj(t))\theta_j^{(t)} \sim posterior(\theta_j | Y_j^{obs}, Y_{-j}^{(t)})

Yjmis,(t)fj(Yj(t),θj(t))+ϵjY_j^{mis,(t)} \sim f_j(Y_{-j}^{(t)}, \theta_j^{(t)}) + \epsilon_j

For continuous variables YjY_j with linear models (σj2\sigma_j^2 residual variance): H^j=E[YjYj,θj],ϵjN(0,σj2),YjmisH^j+ϵj\hat{H}_j = E[Y_j|Y_{-j}, \theta_j],\quad \epsilon_j \sim N(0,\sigma_j^2),\quad Y_j^{mis} \gets \hat{H}_j + \epsilon_j For categorical variables with KK classes, given predicted probabilities pi=(pi1,,piK)p_i = (p_{i1}, \dots, p_{iK}), draw UUnif(0,1)U \sim \mathrm{Unif}(0,1) and assign the class by: Yji=ck    <kpi<UkpiY_{ji} = c_k \iff \sum_{\ell<k} p_{i\ell} < U \leq \sum_{\ell \leq k} p_{i\ell} Each conditional model is implemented as a Spark ML pipeline: observed rows are filtered for fitting, predictions are computed on missing entries, and the imputed values are merged.

Although the pp updates per sweep are intrinsically sequential (forming a computational chain), bigMICE parallelizes each model fit over partitions and can launch mm imputations in parallel. This achieves significant wall-time reduction compared to single-threaded baseline MICE workflows (Morvan et al., 29 Jan 2026).

4. Installation, Configuration, and API Usage

System Requirements and Setup

bigMICE requires Java 8+, R 4.x, sparklyr, devtools, and optionally Hadoop HDFS for checkpointing. Installation steps:

1
2
3
4
install.packages("sparklyr")
sparklyr::spark_install(version="4.0.0")    # tested Spark 4.0.0, sparklyr 1.9.1
install.packages("devtools")
devtools::install_github("bigcausallab/bigMICE")
Memory-controlled Spark connection:
1
2
3
4
5
6
library(sparklyr); library(bigMICE)
conf <- spark_config()
conf$`sparklyr.shell.driver-memory` <- "10G"
conf$spark.memory.fraction <- 0.8
conf$`sparklyr.cores.local` <- 4
sc <- spark_connect("local", config=conf)
(Optional) Enable HDFS checkpointing:
1
spark_set_checkpoint_dir(sc, "/path/to/hdfs/dir")

User-Facing API

bigMICE exposes two principal functions:

  • mice.spark(): Returns pooled analysis results and model parameters only (memory efficient).
  • mice.spark.plus(): Returns imputed Spark DataFrames in addition (high memory overhead).

Key parameters include:

  • data: Spark DataFrame input.
  • sc: Spark connection.
  • variable_types: Mapping of columns to types (Continuous_float, Continuous_int, Binary, Nominal, Ordinal).
  • analysis_formula: Regression formula describing the analysis model.
  • m: Number of imputations (default 5).
  • maxit: Chained-imputation iterations per imputation (default 5).
  • checkpointing: Use of Spark/HDFS checkpointing (default TRUE).
  • Others: predictorMatrix, method, seed, printFlag.

Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
library(sparklyr); library(dplyr); library(bigMICE)

conf <- spark_config()
conf$`sparklyr.shell.driver-memory` <- "10G"
conf$spark.memory.fraction <- 0.8
conf$`sparklyr.cores.local` <- 4
sc <- spark_connect("local", config = conf)

sdf <- spark_read_csv(sc, name="mydata", path="data.csv", header=TRUE, infer_schema=TRUE, null_value="NA")

variable_types <- c(age = "Continuous_int", bmi = "Continuous_float", gender = "Binary", phb = "Nominal", tv = "Continuous_int")
analysis_formula <- as.formula("phb ~ age + gender + tv + bmi")

res <- bigMICE::mice.spark(
  data = sdf,
  sc = sc,
  variable_types = variable_types,
  analysis_formula = analysis_formula,
  m = 3,
  maxit = 2,
  checkpointing = FALSE
)

print(res)                  # pooled results
res$model_params            # per-imputation parameters
res_full <- bigMICE::mice.spark.plus(...)   # imputed datasets
imputed_dfs <- res_full$imputations

5. Performance, Scalability, and Imputation Quality

bigMICE was benchmarked on subsets of the Swedish National Diabetes Registry (14.6 M rows × 10 variables).

Method Max RAM (16 GB cap) Runtime (14M rows, 1 imputation × 5 iter) RMSE (GFR, MCAR 50%)
Standard mice Exceeds 16 GB at ~7M rows >150 min 1.70 (1K rows) → 1.62 (8M rows)
bigMICE ≤16 GB (all sizes) ≈37 min (14M rows) 1.62 (8M rows)
  • Standard mice’s RAM scales linearly past 1M rows, exceeding 16GB around 7M rows, and becomes intractable at larger scales.
  • bigMICE maintains RAM usage within the JVM driver allocation at all tested sizes via disk spilling and checkpointing.
  • Wall-time for bigMICE on 14M rows is approximately 37 minutes — about one-third of mice’s time on smaller subsets.
  • Imputation quality (RMSE) remains stable with increasing nn and high missingness proportions. For n=1n=1M and missingness 10%→99.9%, RMSE is stable between 1.61–1.63, with only a modest increase to 1.71 at near-complete missingness (Morvan et al., 29 Jan 2026).

6. Practical Recommendations and Limitations

Recommended usage patterns:

  • Memory settings: Set sparklyr.shell.driver-memory ≤ available RAM (e.g., "10G" for a 16GB laptop), spark.memory.fraction ≈ 0.8.
  • Checkpointing: Enable checkpointing (default TRUE) and adjust frequency (e.g., every 10 variables) to manage Spark DAG size and disk I/O.
  • Imputation count (m): For large nn, increases in mm beyond 5 yield limited Monte Carlo error reduction but increase computational cost.
  • Model selection: Prefer scalable Spark MLLib learners (linear/logistic models) for speed; random forest offers higher model complexity at the cost of slower runtimes.
  • Data output: Use mice.spark.plus() sparingly if completed imputed datasets are needed; otherwise, rely on pooled results for efficiency.

In large-scale registry settings, reliable imputation is obtained for variables with high missingness (>70%) when sample size is in the millions. bigMICE thus enables statistically robust MICE-based inference at scale, with an API and configuration model accessible to R practitioners and familiar to users of existing MICE workflows (Morvan et al., 29 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to bigMICE Package.