High-dimensional Many-to-many-to-many Mediation Analysis

Published 3 Apr 2026 in stat.ME, q-bio.GN, q-bio.QM, stat.AP, and stat.ML | (2604.02886v1)

Abstract: We study high-dimensional mediation analysis in which exposures, mediators, and outcomes are all multivariate, and both exposures and mediators may be high-dimensional. We formalize this as a many (exposures)-to-many (mediators)-to-many (outcomes) (MMM) mediation analysis problem. Methodologically, MMM mediation analysis simultaneously performs variable selection for high-dimensional exposures and mediators, estimates the indirect effect matrix (i.e., the coefficient matrices linking exposure-to-mediator and mediator-to-outcome pathways), and enables prediction of multivariate outcomes. Theoretically, we show that the estimated indirect effect matrices are consistent and element-wise asymptotically normal, and we derive error bounds for the estimators. To evaluate the efficacy of the MMM mediation framework, we first investigate its finite-sample performance, including convergence properties, the behavior of the asymptotic approximations, and robustness to noise, via simulation studies. We then apply MMM mediation analysis to data from the Alzheimer's Disease Neuroimaging Initiative to study how cortical thickness of 202 brain regions may mediate the effects of 688 genome-wide significant single nucleotide polymorphisms (SNPs) (selected from approximately 1.5 million SNPs) on eleven cognitive-behavioral and diagnostic outcomes. The MMM mediation framework identifies biologically interpretable, many-to-many-to-many genetic-neural-cognitive pathways and improves downstream out-of-sample classification and prediction performance. Taken together, our results demonstrate the potential of MMM mediation analysis and highlight the value of statistical methodology for investigating complex, high-dimensional multi-layer pathways in science. The MMM package is available at https://github.com/THELabTop/MMM-Mediation.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a novel MMM mediation framework that jointly models high-dimensional exposures, mediators, and outcomes using a multivariate LSEM approach.
It employs elastic net regularization and cross-fitted prediction techniques to ensure consistency, asymptotic normality, and effective variable selection.
Empirical studies, including Alzheimer’s disease applications, demonstrate accurate recovery of mediation paths and strong out-of-sample predictive performance.

High-dimensional Many-to-many-to-many Mediation Analysis: Framework, Theory, and Applications

Introduction

The paper "High-dimensional Many-to-many-to-many Mediation Analysis" (2604.02886) introduces and formalizes the many-to-many-to-many (MMM) mediation framework. This model generalizes classical mediation analysis by simultaneously considering multivariate exposures, mediators, and outcomes—each of which may be high-dimensional. The MMM framework enables rigorous estimation, inference, and variable selection in causal pathways linking multiple exposures to multiple mediators and, in turn, to multiple outcomes. It provides both theoretical guarantees (consistency, asymptotic normality, and error bounds for estimators) and empirical validation through comprehensive simulation studies and a substantive application to genetic-neuroimaging-cognitive relationships in Alzheimer's disease (AD).

MMM Mediation Model and Methodology

The MMM mediation setting is modeled by a multivariate linear structural equation model (LSEM), allowing exposures $\mathbf{x} \in \mathbb{R}^q$ , mediators $\mathbf{m} \in \mathbb{R}^p$ , and outcomes $\mathbf{y} \in \mathbb{R}^T$ , with both $q$ and $p$ potentially exceeding sample size $n$ . The model is:

$\begin{aligned} \mathbf{m}_i &= \bm{\alpha}^\top \mathbf{x}_i + \bm{\zeta}^\top \mathbf{z}_i + \bm{\epsilon}_i \ \mathbf{y}_i &= \bm{\beta}^\top \mathbf{m}_i + \bm{\gamma}^\top \mathbf{x}_i + \bm{\eta}^\top \mathbf{z}_i + \bm{\xi}_i \end{aligned}$

where $(\bm{\alpha}, \bm{\beta}, \bm{\gamma}, \bm{\zeta}, \bm{\eta})$ specify path-coefficient matrices, and $\mathbf{z}_i$ denotes covariates. This configuration yields a matrix-valued global indirect (mediation) effect $\bm{\alpha} \bm{\beta}$ mapping exposures to outcomes via all mediators.

A joint penalized estimation procedure is employed, leveraging elastic net regularization in each LSEM stage to enforce sparsity and enable scalable estimation and selection in high-dimensional regimes. The estimation procedure is summarized in Algorithm 1 of the paper and includes cross-fitted out-of-sample prediction, where mediation parameters allow outcome prediction using only exposures and covariates, even when mediators are unavailable.

Figure 1: Schematic of the MMM mediation framework: multivariate exposures, mediators, and outcomes linked via structured paths, with pipeline for coefficient estimation and effect interpretation.

Theoretical Properties

The MMM estimators for direct and indirect effects are shown to be consistent and asymptotically normal under mild regularity conditions. Error bounds for mean squared estimation error are derived explicitly. The theoretical analysis extends elastic net model selection and sign consistency theorems to the high-dimensional multivariate pathway context, using the Elastic Irrepresentable Condition (EIC) adapted to simultaneous high-dimensional settings for exposures and mediators.

Explicit formulas for the identification and estimation of the matrix-valued natural indirect and direct effects are provided under potential outcomes notation, with extensions of sequential ignorability to the MMM causal setting. Entrywise asymptotic normality is established for estimated indirect (mediation) effects, supporting large-sample inference at the individual path level.

Simulation Studies

A comprehensive set of simulation experiments evaluates finite-sample parameter recovery, stability, Type I error behavior, and robustness to noise and sample size. These studies demonstrate:

Highly accurate recovery of coefficient and indirect-effect matrices under block-structured ground truths, attaining clear separation of true nonzero and null mediation pathways.
Bootstrap-based stability indices indicating reproducibility of results and resilience to sampling variability.
Robustness to elevated noise and stability across varying sample sizes; indirect effect estimation is more sensitive due to the compounded nature of $\mathbf{m} \in \mathbb{R}^p$ 0.
Empirical convergence of normalized estimation errors and parameter correlations to the ground truth at moderate-to-large $\mathbf{m} \in \mathbb{R}^p$ 1.
Empirical distributions of selected indirect effects closely approximating normality, corroborating the limiting theory.
Figure 2: Heatmaps of true and estimated $\mathbf{m} \in \mathbb{R}^p$ 2, $\mathbf{m} \in \mathbb{R}^p$ 3, and $\mathbf{m} \in \mathbb{R}^p$ 4, with error convergence, stability, and empirical normality analyses across simulated regimes.

Application: Genetic–Neural–Cognitive Mediation in Alzheimer's Disease

The MMM framework is applied to the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset to elucidate polygenic–brain–cognition pathways. The exposures are 688 genome-wide significant SNPs, mediators are 202 regional cortical thickness measures, and outcomes are 11 AD-related diagnosis and cognitive/behavior endpoints. The framework is able to:

Recover interpretable and biologically coherent genetic–brain ( $\mathbf{m} \in \mathbb{R}^p$ 5) maps, with dominant pathways spanning Default Mode, Control, Dorsal Attention, and Visual cortical networks, consistent with known AD vulnerability loci.
Localize brain-to-cognition mediator effects ( $\mathbf{m} \in \mathbb{R}^p$ 6) to DMN, temporo–parietal, and prefrontal circuits, matching areas repeatedly implicated in cognitive deterioration.
Uncover a compact, structured genetic–neural–cognitive mediation network, identifying both convergent and divergent SNP–ROI–cognition pathways, notably funneling polygenic signals through AD-relevant cortical hubs to diverse cognitive outcomes.
Figure 3: AD application—(a) experimental design, (b) estimated exposure–mediator effects, (c) spatial localization of mediator–outcome effects, (d) mediation network of strongest SNP–brain–outcome links, and (e) multivariate out-of-sample prediction performance.

In addition, out-of-sample prediction experiments reveal that mediator representations and path coefficients identified by MMM enable strong prediction of cognitive and diagnostic outcomes, even when only exposures and covariates are available at test time, thus validating reproducibility.

Implications and Future Directions

The MMM methodology directly addresses critical unmet needs in high-dimensional pathway analysis, where multiple correlated exposures, mediators, and outcomes interact. Practically, it yields interpretable variable selection and pathway identification, supports outcome prediction under real-world data constraints, and integrates seamlessly for scientific inference in complex biological systems such as imaging genetics.

The framework claims consistency, asymptotic normality, and statistical efficiency for indirect effect estimators; empirical results strongly support these claims. Compared to prior work restricted to univariate or separate multivariate exposures/outcomes, the MMM framework enables estimation and hypothesis testing on the entire path matrix in a single, coherent model.

Theoretically, future extensions include nonlinear and nonparametric MMM path models, incorporation of prior knowledge for biomarker or network selection (e.g., Bayesian variants), and longitudinal MMM mediation for dynamic pathway inference. Technically, integration with generalized estimating equations or PDE-based mixed-effects models could further expand MMM utility.

Conclusion

The MMM mediation framework provides a robust, theoretically justified, and empirically validated methodology for high-dimensional, multilayer mediation analysis. It facilitates variable selection, effect estimation, and outcome prediction in settings with multiple exposures, mediators, and outcomes. The approach is effective both in recovering interpretable structures in scientific data (as shown in complex neuroimaging-genomics-disease settings) and in rigorous statistical inference, opening avenues for more granular study of multivariate causal pathways in diverse high-dimensional scientific domains.

Markdown Report Issue