Papers
Topics
Authors
Recent
Search
2000 character limit reached

A Unified Framework for Variable Selection in Model-Based Clustering with Missing Not at Random

Published 25 May 2025 in stat.ME, cs.LG, math.ST, stat.AP, stat.ML, and stat.TH | (2505.19093v2)

Abstract: Model-based clustering integrated with variable selection is a powerful tool for uncovering latent structures within complex data. However, its effectiveness is often hindered by challenges such as identifying relevant variables that define heterogeneous subgroups and handling data that are missing not at random, a prevalent issue in fields like transcriptomics. While several notable methods have been proposed to address these problems, they typically tackle each issue in isolation, thereby limiting their flexibility and adaptability. This paper introduces a unified framework designed to address these challenges simultaneously. Our approach incorporates a data-driven penalty matrix into penalized clustering to enable more flexible variable selection, along with a mechanism that explicitly models the relationship between missingness and latent class membership. We demonstrate that, under certain regularity conditions, the proposed framework achieves both asymptotic consistency and selection consistency, even in the presence of missing data. This unified strategy significantly enhances the capability and efficiency of model-based clustering, advancing methodologies for identifying informative variables that define homogeneous subgroups in the presence of complex missing data patterns. The performance of the framework, including its computational efficiency, is evaluated through simulations and demonstrated using both synthetic and real-world transcriptomic datasets.

Summary

  • The paper introduces a unified framework that integrates signal-based variable selection with MNAR-aware clustering for complex missing data.
  • It employs adaptive penalization and a two-stage procedure to accurately rank variables and assign roles, ensuring consistent selection even under non-ignorable missingness.
  • Empirical evaluations in transcriptomics and synthetic experiments show robust cluster recovery and improved ARI despite increasing missingness levels.

A Unified Framework for Variable Selection in Model-Based Clustering with MNAR Data

Introduction and Motivation

Model-based clustering, particularly finite mixture models such as Gaussian Mixture Models (GMMs), forms the cornerstone of contemporary statistical clustering with a probabilistic foundation. A persistent challenge in this setting is simultaneous variable selection—identifying which observed features are relevant for latent class separation—and robustly handling complex missing data mechanisms, especially when missingness is non-ignorable (Missing Not At Random, MNAR). In high-throughput domains such as transcriptomics, missingness frequently exhibits dependence on latent class structure (e.g., due to class-dependent dropout), rendering standard MAR-imputation pipelines inadequate and producing bias in cluster recovery and feature selection.

The presented work proposes a unified, high-dimensional variable selection framework for model-based clustering under general missingness mechanisms, specifically allowing for MNAR data where missingness can depend on the latent class. The proposed methodology leverages and extends upon previous frameworks for SRUW (Signal–Redundant–Uninformative–Weak) structured variable selection, adaptive LASSO-like penalized likelihood for ranking and selection, and new results on classical and class-MNAR selection models, providing a cohesive theoretical and practical approach.

Framework Overview

SRUW Structure with Explicit Missingness Modeling

The core model decomposes variables into four role categories:

  • S (Signal): Directly clustering-relevant variables.
  • R (Redundant): Variables that are conditionally dependent (via regression) on S.
  • U (Uninformative): Variables independent of clustering.
  • W (Weak): Noise variables.

Clustering is performed via GMM on the S block, with regression for R using S as predictors, and separate modeling for U and W as noise. Crucially, the likelihood function is expanded to combine the observed variable structure and missingness patterns, and can be represented as a global GMM with specific block structures in the covariance and mean parameters.

Unified Treatment of MAR and MNAR Mechanisms

The framework generalizes the observed-data likelihood to model missingness mechanisms as a variable-wise partition:

  • Variables in DMARD_{\text{MAR}} follow a missing-at-random (MAR) pattern.
  • Variables in DMNARD_{\text{MNAR}} follow a class-dependent MNARz mechanism, where missingness depends on the latent class assignment.

This is achieved by augmenting the observed data with missingness masks and factorizing the likelihood accordingly:

(;Yo,C)=n=1Nlogk=1Kπkfk,MARo(yno;θk)fcMNARz(cn;ψk)\ell(\cdot;{Y}^{o}, C) = \sum_{n=1}^{N} \log \sum_{k=1}^{K} \pi_k f^{o}_{k, \mathrm{MAR}}(y_n^o; \theta_k) f_{c}^{\mathrm{MNARz}}(c_n; \psi_k)

where fcMNARz(cn;ψk)f_{c}^{\mathrm{MNARz}}(c_n; \psi_k) denotes the latent-class-dependent missingness model.

Adaptive Penalization and Two-Stage Procedure

The implementation proceeds in two stages:

  • Stage A: Penalized GMM clustering with block-adaptive LASSO penalties for means and precision matrices, where penalty weights are determined using a spectral Laplacian construction from initial estimates; variables are ranked by their empirical inclusion frequency along the entire regularization path.
  • Stage B: Variable role assignment using BIC-based criterion on the ranked list, with model selection for clusters, covariance structure, and role assignment performed in a single scan.

Practical High-Level Workflow

1
2
3
4
5
6
7
8
9
10
11
Given data matrix Y with missing mask C:
    1. Impute Y via single imputation for ranking.
    2. For each (λ, ρ) in grid:
        - Run penalized GMM-EM (adaptive weights) on imputed Y.
        - Record variable inclusion (mean ≠ 0) per λ.
    3. Rank variables by non-zero mean frequency.
    4. Role assignment:
        - Traverse ranked variables.
        - For each, fit unpenalized SRUW-MNARz on incomplete Y.
        - Assign role (Signal/Redundant/Uninformative/Weak) using BIC.
    5. Select final SRUW partition and parameters.

Theoretical Guarantees

Identifiability and Consistency

  • Identifiability: The joint model (clustering + missingness mechanism + variable role structure) is shown to be generically identifiable under class-dependent MNARz missingness, provided the block-specific covariance structures admit sufficient non-degeneracy, following a generalization of the algebraic conditions established for SRUW and mixture models.
  • Selection consistency: The BIC-type model selection procedure is shown to consistently recover the true variable role partition as NN \to \infty, even in the presence of MNAR data, under classical conditions for penalized MLE (restricted eigenvalue, bounded third derivative, and Hessian concentration for GMM likelihood).
  • Ranking consistency (two-stage LASSO-like): For the penalized GMM ranking stage, a restricted strong convexity (RSC) is established on the loss for high-probability error control; together with variable ranking separation (signal/noise gap), this ensures that the correct list of relevant variables appears before the noise in the ranked list with high probability, as confirmed via formal oracle inequalities.

Empirical Evaluation

Synthetic Experiments

Clustering and Imputation Performance

  • Simulated data with controlled cluster means and regression structure was used to compare the unified framework (SelvarMNARz) with pre-imputation pipelines using random forest (missForest), classical (VarSelLCM), and previous MAR-focused variable selection approaches.
  • Under increasing missingness, SelvarMNARz displayed only a marginal decline in ARI and NRMSE, whereas impute-then-cluster pipelines exhibited rapid degradation, especially for MNAR scenarios.
  • The improvement in ARI at high missingness is statistically significant (Welch t-test, p<0.001p<0.001) across 20 replications. Figure 1

Figure 1

Figure 1: Comparison of four models under MAR and MNAR mechanisms over 20 replications; for ARI/WNRMSE, higher/lower boxplots indicate better performance.

Variable and Cluster Recovery

  • The unified method exhibited stable recovery of the correct number of clusters and the set of relevant variables as missingness increased, a property not shared by other baseline methods.
  • Other methods, particularly those decoupling imputation from clustering, started missing clustering-relevant variables once missingness exceeded 20%. Figure 2

    Figure 2: Proportions choosing correct relevant variables and cluster components over 20 replications.

Case Study: Arabidopsis Transcriptome

  • Applied to a transcriptomic dataset with 1,267 genes and 27 conditions, the method discovered 18 coherent clusters and consistently identified early (P1–P4) as principal axes, with later projects (P5–P7) found redundant after adjusting for MNAR patterns in missingness.
  • Within each cluster, regression R2R^2 diagnostics validated that late-stage stress variables' variation was well explained by early axes, in contrast to previous analyses that (under MAR assumptions) assigned variable roles differently.
  • This demonstrates the framework's practical ability to uncover biologically coherent groups and clarify the conditional structure in the presence of class-dependent missingness. Figure 3

    Figure 3: Mean expression profiles across 18 clusters. Light region indicates irrelevant P.

Algorithmic and Computational Considerations

  • Initialization: Stability is enhanced by using robust single imputation for ranking and cluster-aware covariance estimation, followed by penalized EM path-solving for variable ranking.
  • Regularization grid: Maximal penalty values for mean and precision block parameters are computed via data-driven KKT thresholds, then a geometric sequence is used (with theoretical motivation) for path traversal.
  • Scalability: Overall algorithm scales polynomially as O(MgridMEM(NKD2+Kd3))O(M_{\text{grid}} M_{\text{EM}} (N K D^2 + K d^3)), a significant improvement over the O(D5)O(D^5) complexity of classic stepwise selection. Empirically, runtime decreases with increasing missingness due to quadratic scaling with the number of observed entries per record.

Discussion and Implications

Practical and Theoretical Impact

  • The unified SRUW-MNARz framework enables simultaneous, asymptotically consistent variable selection and clustering in high-dimensional data under non-ignorable missingness, directly extending the applicability of penalized model-based clustering pipelines from MAR to MNAR settings.
  • Automated, data-driven penalty selection and robust initialization make the method deployable in real-world contexts, especially where missingness is plausibly outcome/class-dependent (e.g., omics, biobank, sensor settings).
  • The approach demonstrates robustness to model misspecification and mixed missingness mechanisms; precise variable partitioning is consistently recovered even when the true missing pattern is a mixture.

Limitations and Future Directions

  • The current model assumes continuous, Gaussian data. Extension to mixed and categorical variables (e.g., latent class models for discrete features) is warranted.
  • Automated, model-based selection of the MAR/MNARz partition is not yet addressed; future work should seek iterative or Bayesian techniques for variable-wise missingness mechanism learning.
  • Improving computational efficiency for very large datasets and optimizing penalty selection via cross-validation or information-criterion learning could further enhance scalability.

Conclusion

This work establishes a unified and theoretically grounded methodology for joint variable selection and model-based clustering in the presence of MNAR data, with strong empirical and theoretical guarantees. The method provides immediate utility for high-dimensional, high-missingness data domains, particularly in genomics and large-scale biomedical inference, and opens promising directions for future research in robust, interpretable clustering under complex data-generating and missingness mechanisms.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concrete, actionable list of the main uncertainties and open problems that remain unresolved by the paper:

  • Unspecified inference of MAR vs MNARz assignment per variable: The method assumes a known partition of variables into MAR and MNARz sets (D_MAR and D_MNAR), but provides no procedure to learn this partition from data, nor to assess uncertainty or misclassification effects when the partition is unknown.
  • Restrictive MNARz mechanism: The MNARz model uses a single class-specific missingness probability ρ(ψk) shared across all variables (i.e., product of identical Bernoulli terms per class), disallowing variable-specific (and potentially heterogeneous) class-dependent missingness; no extension is provided for per-variable, per-class probabilities ρ{kd} or correlated missingness across variables.
  • Missingness dependent on both class and values: The framework relies on the MNARz structural restriction (missingness depends on class or on value, but not both). It does not address identification, estimation, or robustness when missingness depends simultaneously on latent class and (possibly unobserved) values, which is common in practice.
  • Learning the missingness model for the MAR block at scale: The MAR component introduces variable-specific models (e.g., logistic), but the paper does not detail how to estimate these in high dimensions (regularization, identifiability, stability) or how their uncertainty propagates to clustering and variable role assignment.
  • Factorization and ignorability claims under mixed MAR/MNAR: The observed-data likelihood factorization (separating MNARz from MAR) is asserted but not fully derived for the general SRUW model with mixed mechanisms; edge cases and failure modes of this factorization are not analyzed.
  • Ranking on single imputation under MNAR: Stage A ranks variables using a single, fast imputation, which can bias ranking when missingness is informative (MNAR). There is no assessment of ranking robustness nor alternatives (e.g., multiple imputation, EM-based pseudo-completed data, or direct penalized likelihood with missingness modeled).
  • Ranking criterion focuses on mean separation only: The variable-ranking score counts nonzero component means along the penalty path, potentially missing variables that discriminate clusters via covariance/precision differences only; no covariance-aware ranking (e.g., based on changes in precision matrices) is provided.
  • Pathwise label alignment: The ranking aggregates nonzero means across a (λ, ρ) path, but the paper does not address label alignment across EM runs (label switching across the path), which can invalidate the “nonzero count” score without an explicit label-matching scheme.
  • Spectral weighting reduces adaptivity: The proposed spectral-based weighting for precision penalties yields a single scalar weight per cluster (P_{k,ij} constant over i,j), losing element-wise adaptivity compared to inverse partial-correlation weights; there is no analysis of when this helps or hurts, and no guidance on choosing the adjacency threshold τ_adj.
  • Hyperparameter selection and sensitivity: The approach depends on multiple user-chosen knobs (λ, ρ paths, ξ for path endpoints, ε in spectral weights, τ_adj for graph construction), yet the paper does not supply data-driven selection rules, sensitivity analyses, or default heuristics with theoretical support.
  • Initialization and local optima: EM is known to be sensitive to initialization and local maxima, especially with missing data. The paper proposes heuristics but provides no systematic study (multi-start strategies, deterministic annealing, or convergence diagnostics) or guarantees beyond heuristics.
  • Degeneracy and covariance constraints: Finite mixtures with missingness are prone to degeneracy (e.g., singular covariances). The paper does not analyze how the chosen parsimonious covariance families m interact with degeneracy risk under missingness or how to enforce safeguards.
  • Asymptotic regime not specified for high-dimensionality: Theoretical results are “informal” and defer assumptions to the supplement, without clarifying whether selection consistency holds when p grows with n (p ≫ n settings common in transcriptomics), or only for fixed p.
  • Joint selection of K, m under MNARz: There is no theoretical guarantee for consistent selection of the number of clusters K and covariance structure m under the mixed MAR/MNAR scenario; proofs assume known (K, m, r, l), limiting practical applicability.
  • BIC form under MNARz not derived: The paper applies a BIC-type criterion but does not explicitly derive the penalty terms accounting for the MNARz parameters and mixed missingness; the impact of missingness modeling complexity on BIC consistency is not analyzed.
  • One-pass role assignment suboptimality: Stage B assigns roles (S/R/U/W) via a single pass along the ranked list; there is no analysis of suboptimal role assignments due to ordering, nor strategies (e.g., look-ahead, local swaps, or stability selection) to mitigate ordering-induced errors.
  • Handling extreme missingness and unbalanced classes: The method’s behavior under high missingness rates, class imbalance, and near-separable missingness patterns is not studied (e.g., breakdown points, identifiability thresholds, or sample complexity requirements).
  • Non-Gaussian and domain-specific data models: The framework assumes Gaussian mixtures, yet transcriptomic data are often counts, zero-inflated, or heavy-tailed; there is no extension to mixtures of GLMs, copulas, or robust/transform-based alternatives, nor guidance on when Gaussian approximations suffice.
  • Correlated missingness processes: The missingness model assumes independence across variables given the mechanism (MAR or MNARz). Real data often show correlated or block-wise missingness; no hierarchical or shared-parameter missingness models are considered.
  • Uncertainty quantification and post-selection inference: The MLE-based approach yields point estimates without measures of uncertainty for cluster assignments, variable roles, or missingness parameters, and does not address post-selection inference or stability of selected sets.
  • Computational complexity and scalability: The paper does not provide time/memory complexity, scaling behavior with (N, D, K) under missingness, or ablation showing the trade-offs of the two-stage pipeline (path length L, number of grid points, and number of restarts).
  • Reproducibility specifics: Details on code availability, implementation choices (e.g., solvers for penalized precision updates under missingness), and reproducibility (random seeds, numerical tolerances) are not supplied.
  • Sensitivity to misspecification of MAR/MNAR partition: No simulation study or theory quantifies the impact of mislabeling variables as MAR vs MNARz on clustering accuracy, variable selection, or identifiability.
  • Estimation of the MAR missingness predictors: When MAR depends on high-dimensional observed covariates, the paper does not specify regularization or model selection for ρd(yo; γ{Md}), nor its interaction with the SRUW role assignment (e.g., leakage of S variables into the missingness model).
  • Integration/imputation details in EM: While the paper references EM with internal imputation, it does not detail how conditional expectations are computed under SRUW with mixed MAR/MNAR, nor how numerical integration issues are handled when patterns are complex.
  • Lack of robustness to outliers: There is no analysis of robustness (e.g., heavy tails, leverage points) in either clustering or missingness modeling, and no robust penalties or loss functions are considered.
  • Absence of covariance-aware variable scoring in Stage A: The precision-penalization is used for estimation but not for ranking; a principled score that integrates changes in both means and precisions across the path is not developed.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.