Structured Bottlenecks for Missing Data

Updated 30 January 2026

The paper introduces a structured bottleneck framework that compresses covariate blocks and preserves relevant information for robust treatment effect estimation.
The proposed method employs block-specific information bottleneck objectives and differentiable encoders to effectively handle systematic missingness at test time.
Empirical evaluations demonstrate state-of-the-art causal inference performance and estimation consistency under various missing data regimes.

Structured bottlenecks for missing data denote formal approaches that leverage explicit information-theoretic or statistical structures to enable robust estimation, causal inference, and model selection in the presence of patterned or block-wise missingness. Such bottlenecks are designed to compress observed covariates into discrete or low-dimensional codes while maximally retaining relevant information for downstream tasks, and to enable principled transfer of learned representations to prediction or inference when covariate blocks are systematically absent—especially during test or deployment. Key methodologies include deep information bottleneck objectives partitioned by block-wise missingness, discrete clustering, multi-source imputation, and optimal integration of block-specific estimating equations.

1. Structured Information Bottleneck Objectives

Structured bottlenecks arise from information bottleneck (IB) principles formalizing selective compression. In the Cause-Effect Deep Information Bottleneck (CEIB) approach (Parbhoo et al., 2018), the covariate vector $X \in \mathbb{R}^d$ is explicitly partitioned into two blocks: $X_1$ (available at train time, systematically missing at test time) and $X_2$ (always available). Discrete latent codes $V_1$ and $V_2$ are learned for $X_1$ and $X_2$ , respectively; these are concatenated as $Z = (V_1, V_2)$ . The structured IB objective optimally compresses each block while preserving information about the target $(Y, T)$ :

$\max_{\phi, \theta, \psi, \eta}\, -I_\phi(V_1; X_1) - I_\eta(V_2; X_2) + \lambda\, I_{\phi, \theta, \psi, \eta}(Z; (Y, T))$

with block-specific compression terms

$I_\phi(V_1; X_1) = \mathbb{E}_{p(x_1)} \mathrm{KL}\left(q_\phi(v_1|x_1) \,||\, p(v_1)\right)$

and analogous definition for $I_\eta(V_2; X_2)$ . The relevance term $I(Z; (Y,T))$ is lower-bounded by the expected decoder log-likelihood over sampled cluster assignments. This split structure ensures that information discard and retention are precisely controlled per observed covariate block.

2. Encoder, Decoder, and Cluster Design

The CEIB method implements structured encoders for each block: $q_\phi(v_1|x_1)$ and $q_\eta(v_2|x_2)$ , each mapping inputs to logits, followed by categorical sampling via Gumbel-softmax reparameterization to achieve differentiability. Bottleneck $Z = (V_1, V_2)$ thus indexes a $K_1 \times K_2$ discrete grid of equivalence-class clusters. Decoders estimate $p_\psi(t|z)$ for treatment assignment (Bernoulli output) and $p_\theta(y|t, z)$ for outcomes (Gaussian with $t$ -dependent mean heads $f_2$ and $f_3$ ). This structure ensures interpretability: each cluster encodes a distinct treatment-outcome profile under covariate compression. By segmenting cluster codes according to block-wise missingness patterns, structured bottlenecks facilitate prediction with incomplete test-time data (Parbhoo et al., 2018).

3. Handling Systematic Block-wise Missingness at Test Time

CEIB explicitly supports transfer of learned cluster structure when critical covariate blocks are absent at deployment. When $X_1$ is missing, the encoder $q_\eta(v_2|x_2)$ is used to assign the test case to its $V_2$ cluster, while $V_1$ is imputed via several strategies:

Prior plug-in: Assign $v_1^*$ as the most probable code under the learned $p(v_1)$ .
Cluster averaging: Average predicted outcomes or cluster effects across possible $v_1$ , weighted by $p(v_1=j)$ .
Mode of joint clusters: Select $(v_1^*, v_2^*)$ maximizing joint cluster occupancy in training.

This results in mapping each incomplete case to its most probable or averaged equivalence class, from which treatment effect estimates are read off. Aggregation over $v_1$ allows recovery of a purely $v_2$ -dependent cluster effect. This structured test-time transfer yields reliable treatment effect estimation under systematically missing covariates, demonstrated to achieve state-of-the-art performance on causal inference benchmarks and a sepsis application (Parbhoo et al., 2018).

4. Multi-source Block-wise Imputation via Estimating Equations

Alternatively, integrating multi-source block-wise missing data in model selection addresses missingness by generating multiple conditional imputations from both complete and partially observed blocks, forming the basis for efficient estimation (Xue et al., 2019). Here, the data are partitioned into $R$ disjoint groups according to missing-pattern; for each group $r$ , and each imputation group $k \in \mathcal{G}(r)$ , conditional means $\widehat{E}(X_{il}|\bm{X}_{i,J(r,k)})$ are fit for missing block variables. Completed covariate vectors are constructed per imputation, which then enter estimating equations:

$U_{i}^{(r,k)}(\bm\beta) = \bm{z}_i^{(k)\,\top}\left\{y_i - \mu_i^{(k)}(\bm\beta)\right\}$

Stacking these yields the full system $U(\bm\beta)$ , which is integrated via a penalized generalized method-of-moments (GMM) objective for joint estimation and variable selection:

$\hat{\bm{\beta}} = \arg\min_{\bm{\beta}\in\mathbb{R}^p} \left\{ U(\bm{\beta})^\top \widehat{W}(\bm{\beta})^{-1} U(\bm{\beta}) + \sum_{j=1}^p p_\lambda(|\beta_j|) \right\}$

where $p_\lambda(\cdot)$ denotes a nonconcave penalty, typically SCAD. This structured imputation utilizes all available block-sources and achieves asymptotic efficiency gains over single complete-case imputation (Xue et al., 2019).

5. Optimization and Hyperparameter Selection

Training structured bottleneck models involves differentiable Monte Carlo estimation of the decoder log-likelihood (for relevance term) and closed-form computation of block-specific KL terms (for compression). Both CEIB and MBI frameworks employ stochastic gradient descent—often Adam with learning rate near $10^{-3}$ —on mini-batches of block-partitioned data (Parbhoo et al., 2018, Xue et al., 2019). For penalized GMM objectives, conjugate-gradient minimization is applied, with principal-component extraction to stabilize sample covariance matrix inversion. Hyperparameter $\lambda$ balances the compression–relevance trade-off and is tuned via cross-validation on prediction or causal effect error (e.g., ACE error for CEIB, BIC-type criterion for MBI).

6. Practical Impact and Empirical Evaluation

Structured bottlenecks demonstrate practical effectiveness across simulation regimes and real-world applications. CEIB attains reliable and interpretable treatment effect estimates for incomplete covariate settings without sacrificing benchmark performance compared to competing approaches (Parbhoo et al., 2018). MBI achieves estimation and model selection consistency under both fixed and high-dimensional regimes, selection sparsity, and asymptotic normality (Xue et al., 2019). In biomedical applications, MBI selects biomarkers corroborated by external studies with test RMSE reductions of 20–25% over single-imputation and competing methods, robust to Missing-At-Random, Missing-Completely-At-Random, and informative missingness (Xue et al., 2019). This suggests that exploiting missingness structure via bottlenecks or multi-source imputation provides substantial efficiency and accuracy gains.

7. Theoretical Guarantees and Limitations

Both CEIB and MBI frameworks offer theoretical guarantees on consistency, efficiency, and recovery of sparsity. In fixed-dimensional settings, estimation error $\|\hat{\bm{\beta}} - \bm{\beta}^0\|_2 = O_p(N^{-1/2}\zeta_N)$ is achieved, with improved covariance bounds compared to single imputation ( $\bm{V}^{(1)} - \bm{V} \succeq 0$ ). For diverging dimensions, rate conditions ensure local minimizer existence and sparsity. A plausible implication is that strict block-wise partitioning and exploitation of all informative subsources are key to optimal statistical power in high missingness regimes (Xue et al., 2019). Limitations include the need for suitable regularity in imputation models and identifiable covariance structures, as well as trade-offs in cluster granularity and interpretability depending on compression parameterization.

Structured bottlenecks constitute a principled, theoretically supported approach to systematically missing data, enabling robust causal inference and model selection through explicit exploitation and transfer of data structure.

Markdown Report Issue Upgrade to Chat

References (2)

Cause-Effect Deep Information Bottleneck For Systematically Missing Covariates (2018)

Integrating multi-source block-wise missing data in model selection (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Structured Bottlenecks for Missing Data.