Multi-Site Health Research Integrating Complementary Data Sources: A Scoping Review of Statistical Inference Methods for Vertically Partitioned Data

Published 3 Apr 2026 in stat.ME | (2604.02595v1)

Abstract: To address the multidimensional nature of health-related questions, advances in health research often require integrating information from various data sources within statistical analyses. When complementary information pertaining to the same set of individuals are distributed across different institutions, vertical methods make it possible to obtain analysis results without sharing or pooling individual-level data. To guide stakeholders toward a transparent use of vertical methods, this study aims to (1) Identify existing vertical methods enabling statistical inference; and (2) Characterize the methodological properties of these methods and the current extent of their use with health data. We conducted a scoping review using four interdisciplinary databases. We then systematically extracted the characteristics of identified vertical methods with respect to comparability with the pooled analysis, efficiency of communication schemes and confidentiality. We additionally screened studies that cited included articles to identify applications on vertically partitioned real-world health data. Among 2887 articles initially screened, 30 were included in the review. Inference for the linear and the logistic regression framework were the most frequent statistical inference tasks undertaken in proposed methods. Equivalence with the pooled analyses was not systematically addressed and most methods required multiple communications between participating parties. Almost all articles described their approach as privacy-preserving, although a minority provided privacy assessments. The scope of existing approaches enabling statistical inference for vertically partitioned data is still relatively limited. Most existing methods do not concurrently achieve results equivalent to centralized analyses, high communication efficiency, and guaranteed protection of individual-level data.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper provides a comprehensive review of distributed inference protocols applied to vertically partitioned health data, comparing methods like encryption-based, SMC, and separable approaches.
The paper highlights that communication efficiency and scalability are critical challenges, with iterative protocols increasing operational overhead and limiting model applicability.
The paper underscores significant privacy limitations, noting the need for rigorous assessments and formal guarantees in real-world multi-site health research.

Statistical Inference Methods for Vertically Partitioned Data in Multi-Site Health Research

Overview and Context

Integrating multimodal health-related datasets across institutions is critical for addressing complex, multidimensional questions in contemporary health research. The vertical partitioning paradigm, where complementary features (clinical, genomic, socioeconomic, environmental) for a common cohort are distributed among independent custodians, presents unique analytical and regulatory challenges. Unlike the extensively studied horizontal setting, vertical partitioning precludes pooling individual-level data due to legal, ethical, and technical constraints, necessitating development of distributed inference protocols that optimize utility, efficiency, and confidentiality.

This paper provides a comprehensive scoping review of statistical inference methods applicable to vertically partitioned data, focusing on their comparability with pooled analyses, communication efficiency, confidentiality guarantees, and documented health analytics applications. Thirty articles passing stringent inclusion criteria are systematically examined, spanning parametric, semi-parametric, and non-parametric models, alongside probability-based and cryptographic approaches.

Methodological Landscape

Model Support and Statistical Tasks

The literature demonstrates a clear bias toward standard regression models: linear regression (30%), logistic regression (20%), and Cox proportional hazards regression (10%) dominate (with sporadic support for parametric/semi-parametric alternatives and classical inference tasks like Welch t-test, chi-square, sign tests, and causal inference). Coverage of general GLMs, GLMMs, multivariate and longitudinal models is sparse, creating a gap for practical, high-dimensional health analytics requirements. Notably, numerical equivalence with centralized inferences is inconsistently reported, with some methods asserting theoretical exactness, others settling for empirical agreement up to several decimal places in simulation studies.

Protocols for Vertical Distribution

Three primary classes of algorithms emerge:

Encryption-based schemes (43%)—utilizing homomorphic encryption or secret sharing for secure computation of statistics.
Secure multiparty computation (SMC) (37%)—incorporating randomness or protocol steps such that only global statistics are reconstructable, not individual data.
Vertically separable quantities (13%)—directly exploiting algebraic separability of relevant statistics or likelihood components (often solely feasible for models with closed-form solutions).

The operational complexity and communication burden is highly protocol-dependent. Iterative SMC-based and encryption-driven protocols are the norm, especially for non-linear models (e.g., logistic, Cox), while closed-form models (e.g., linear regression) occasionally admit limited one-shot communication in two-node settings. In general, as the number of nodes scales, communication efficiency deteriorates sharply, which is a major practical bottleneck.

Confidentiality and Privacy Guarantees

Privacy-preserving computation is claimed by almost all reviewed methods; however, only a minority provide rigorous privacy assessments, formal cryptographic proofs, or consider potential for reconstructing line-level data from intermediates. Many approaches rely on the absence of direct data exchange as a surrogate for privacy—a practice insufficient for stringent custodians (e.g., Canadian provincial statistical agencies). Several methods further assume the availability or sharing of raw outcome variables across nodes, which constitutes a major limitation in real-world regulatory environments. The majority of privacy guarantees are predicated on the honest-but-curious adversarial model—semi-honest parties execute protocols faithfully but may seek additional information from seen outputs. Joint exploitation of aggregated outputs is rarely analyzed, and disclosure risks associated with final statistics (coefficients, confidence intervals) are typically ignored.

Communication Efficiency and Scalability

Communication patterns sharply influence practical feasibility. Only six articles propose one-shot schemes (single communication round per node), most restricted to two-node cases or to specific models where algorithmic separability applies. Logistic and Cox regression inference, which are pivotal in health analytics, almost invariably necessitate extensive iterative exchange and/or centralized coordination. As a result, operational overheads are high—manual review, network latency, and non-automated infrastructures become barriers to adoption. Methods designed around iterative parameter updates or cryptographic primitives (e.g., homomorphic encryption) further exacerbate resource demands.

Real-World Applications and Uptake

Despite theoretical advances, adoption of vertical partitioning methods for distributed statistical inference in actual health analytics platforms is minimal. Only a handful of systems (e.g., PopMedNet, D-CLEF) and pilot studies reference or implement these methods, mainly for basic regression analyses. Scarcity of practical use points to unresolved operational, regulatory, and model scope challenges: limited model support, lack of formal privacy guarantees, and communication burdens collectively hinder uptake. The review postulates that substantive engagement with data custodians and further methodological innovations are required for broader implementation.

Implications for AI and Future Directions

The reviewed methods provide foundational strategies relevant to federated analytics for sensitive biomedical data, but the current state does not concurrently optimize statistical fidelity, communication efficiency, and formal confidentiality. For AI and health informatics, the implications are as follows:

Strong numerical claims (exact equivalence) are feasible for linear models under certain protocols, but encryption-driven or SMC-based methods generally suffer from approximation errors.
Adoption will require development of scalable one-shot protocols for broader classes of models (including GLMMs, high-dimensional regression, and time-to-event analyses) and rigorous privacy risk assessments beyond non-exchange of line-level data.
Advances in privacy-preserving distributed optimization and federated learning will likely inform future protocol design, enabling richer model-fitting, causal inference, and multimodal fusion—all vital for next-generation AI-driven health analytics.
From a theoretical perspective, characterizing conditions under which vertically partitioned algorithms achieve asymptotic and finite-sample equivalence to centralized estimators remains an open problem, as does operationalizing differential privacy or cryptographic guarantees at scale.

Conclusion

The literature on statistical inference methods for vertically partitioned data demonstrates a limited scope, with predominant focus on parametric regression and significant gaps in broader model support and operational practicality. Most extant approaches fail to achieve simultaneous equivalence with centralized analyses, communication efficiency, and robust privacy protection, restricting applicability to sensitive health research contexts. Comprehensive privacy assessments, extension to high-dimensional and longitudinal models, and resource-efficient communication protocols will be critical for future developments. Addressing these barriers has direct implications for the evolution of distributed AI and federated analytics in health informatics (2604.02595).

Markdown Report Issue