A Unified Framework for Variable Selection in Model-Based Clustering with Missing Not at Random
Abstract: Model-based clustering integrated with variable selection is a powerful tool for uncovering latent structures within complex data. However, its effectiveness is often hindered by challenges such as identifying relevant variables that define heterogeneous subgroups and handling data that are missing not at random, a prevalent issue in fields like transcriptomics. While several notable methods have been proposed to address these problems, they typically tackle each issue in isolation, thereby limiting their flexibility and adaptability. This paper introduces a unified framework designed to address these challenges simultaneously. Our approach incorporates a data-driven penalty matrix into penalized clustering to enable more flexible variable selection, along with a mechanism that explicitly models the relationship between missingness and latent class membership. We demonstrate that, under certain regularity conditions, the proposed framework achieves both asymptotic consistency and selection consistency, even in the presence of missing data. This unified strategy significantly enhances the capability and efficiency of model-based clustering, advancing methodologies for identifying informative variables that define homogeneous subgroups in the presence of complex missing data patterns. The performance of the framework, including its computational efficiency, is evaluated through simulations and demonstrated using both synthetic and real-world transcriptomic datasets.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concrete, actionable list of the main uncertainties and open problems that remain unresolved by the paper:
- Unspecified inference of MAR vs MNARz assignment per variable: The method assumes a known partition of variables into MAR and MNARz sets (D_MAR and D_MNAR), but provides no procedure to learn this partition from data, nor to assess uncertainty or misclassification effects when the partition is unknown.
- Restrictive MNARz mechanism: The MNARz model uses a single class-specific missingness probability ρ(ψk) shared across all variables (i.e., product of identical Bernoulli terms per class), disallowing variable-specific (and potentially heterogeneous) class-dependent missingness; no extension is provided for per-variable, per-class probabilities ρ{kd} or correlated missingness across variables.
- Missingness dependent on both class and values: The framework relies on the MNARz structural restriction (missingness depends on class or on value, but not both). It does not address identification, estimation, or robustness when missingness depends simultaneously on latent class and (possibly unobserved) values, which is common in practice.
- Learning the missingness model for the MAR block at scale: The MAR component introduces variable-specific models (e.g., logistic), but the paper does not detail how to estimate these in high dimensions (regularization, identifiability, stability) or how their uncertainty propagates to clustering and variable role assignment.
- Factorization and ignorability claims under mixed MAR/MNAR: The observed-data likelihood factorization (separating MNARz from MAR) is asserted but not fully derived for the general SRUW model with mixed mechanisms; edge cases and failure modes of this factorization are not analyzed.
- Ranking on single imputation under MNAR: Stage A ranks variables using a single, fast imputation, which can bias ranking when missingness is informative (MNAR). There is no assessment of ranking robustness nor alternatives (e.g., multiple imputation, EM-based pseudo-completed data, or direct penalized likelihood with missingness modeled).
- Ranking criterion focuses on mean separation only: The variable-ranking score counts nonzero component means along the penalty path, potentially missing variables that discriminate clusters via covariance/precision differences only; no covariance-aware ranking (e.g., based on changes in precision matrices) is provided.
- Pathwise label alignment: The ranking aggregates nonzero means across a (λ, ρ) path, but the paper does not address label alignment across EM runs (label switching across the path), which can invalidate the “nonzero count” score without an explicit label-matching scheme.
- Spectral weighting reduces adaptivity: The proposed spectral-based weighting for precision penalties yields a single scalar weight per cluster (P_{k,ij} constant over i,j), losing element-wise adaptivity compared to inverse partial-correlation weights; there is no analysis of when this helps or hurts, and no guidance on choosing the adjacency threshold τ_adj.
- Hyperparameter selection and sensitivity: The approach depends on multiple user-chosen knobs (λ, ρ paths, ξ for path endpoints, ε in spectral weights, τ_adj for graph construction), yet the paper does not supply data-driven selection rules, sensitivity analyses, or default heuristics with theoretical support.
- Initialization and local optima: EM is known to be sensitive to initialization and local maxima, especially with missing data. The paper proposes heuristics but provides no systematic study (multi-start strategies, deterministic annealing, or convergence diagnostics) or guarantees beyond heuristics.
- Degeneracy and covariance constraints: Finite mixtures with missingness are prone to degeneracy (e.g., singular covariances). The paper does not analyze how the chosen parsimonious covariance families m interact with degeneracy risk under missingness or how to enforce safeguards.
- Asymptotic regime not specified for high-dimensionality: Theoretical results are “informal” and defer assumptions to the supplement, without clarifying whether selection consistency holds when p grows with n (p ≫ n settings common in transcriptomics), or only for fixed p.
- Joint selection of K, m under MNARz: There is no theoretical guarantee for consistent selection of the number of clusters K and covariance structure m under the mixed MAR/MNAR scenario; proofs assume known (K, m, r, l), limiting practical applicability.
- BIC form under MNARz not derived: The paper applies a BIC-type criterion but does not explicitly derive the penalty terms accounting for the MNARz parameters and mixed missingness; the impact of missingness modeling complexity on BIC consistency is not analyzed.
- One-pass role assignment suboptimality: Stage B assigns roles (S/R/U/W) via a single pass along the ranked list; there is no analysis of suboptimal role assignments due to ordering, nor strategies (e.g., look-ahead, local swaps, or stability selection) to mitigate ordering-induced errors.
- Handling extreme missingness and unbalanced classes: The method’s behavior under high missingness rates, class imbalance, and near-separable missingness patterns is not studied (e.g., breakdown points, identifiability thresholds, or sample complexity requirements).
- Non-Gaussian and domain-specific data models: The framework assumes Gaussian mixtures, yet transcriptomic data are often counts, zero-inflated, or heavy-tailed; there is no extension to mixtures of GLMs, copulas, or robust/transform-based alternatives, nor guidance on when Gaussian approximations suffice.
- Correlated missingness processes: The missingness model assumes independence across variables given the mechanism (MAR or MNARz). Real data often show correlated or block-wise missingness; no hierarchical or shared-parameter missingness models are considered.
- Uncertainty quantification and post-selection inference: The MLE-based approach yields point estimates without measures of uncertainty for cluster assignments, variable roles, or missingness parameters, and does not address post-selection inference or stability of selected sets.
- Computational complexity and scalability: The paper does not provide time/memory complexity, scaling behavior with (N, D, K) under missingness, or ablation showing the trade-offs of the two-stage pipeline (path length L, number of grid points, and number of restarts).
- Reproducibility specifics: Details on code availability, implementation choices (e.g., solvers for penalized precision updates under missingness), and reproducibility (random seeds, numerical tolerances) are not supplied.
- Sensitivity to misspecification of MAR/MNAR partition: No simulation study or theory quantifies the impact of mislabeling variables as MAR vs MNARz on clustering accuracy, variable selection, or identifiability.
- Estimation of the MAR missingness predictors: When MAR depends on high-dimensional observed covariates, the paper does not specify regularization or model selection for ρd(yo; γ{Md}), nor its interaction with the SRUW role assignment (e.g., leakage of S variables into the missingness model).
- Integration/imputation details in EM: While the paper references EM with internal imputation, it does not detail how conditional expectations are computed under SRUW with mixed MAR/MNAR, nor how numerical integration issues are handled when patterns are complex.
- Lack of robustness to outliers: There is no analysis of robustness (e.g., heavy tails, leverage points) in either clustering or missingness modeling, and no robust penalties or loss functions are considered.
- Absence of covariance-aware variable scoring in Stage A: The precision-penalization is used for estimation but not for ranking; a principled score that integrates changes in both means and precisions across the path is not developed.
Collections
Sign up for free to add this paper to one or more collections.
