Multimodal Fusion for PROs

Updated 10 February 2026

Multimodal integration for PROs is the process of synthesizing diverse health data including clinical, sensor, and free-text inputs into a unified, actionable patient representation.
Advanced fusion techniques like cross-modal attention, joint embedding, and decision-level aggregation align structured and unstructured data to improve predictive accuracy.
Practical implementations demonstrate significant gains in patient outcome prediction, achieving higher AUC and reduced error through robust handling of missingness and asynchrony.

Multimodal integration for patient-reported outcomes (PROs) addresses the challenge of synthesizing heterogeneous data streams—including clinical variables, PROs, social determinants of health, biological measurements, and behavioral signals—into unified patient representations to support clinical prediction, monitoring, and individualized intervention. Central to these efforts are advanced architectures for feature encoding, cross-modal alignment, and statistically principled fusion, combined with rigorous handling of missingness, temporal asynchrony, and outcome-specific biases. The integration of PROs, particularly free-text narratives, into multimodal machine learning systems enables a more complete understanding of patient state while highlighting methodological challenges unique to healthcare data.

1. Core Principles of Multimodal Fusion Incorporating PROs

Multimodal integration in healthcare leverages diverse data modalities: structured EHR variables, social determinants, multi-omic measurements, sensor streams, and PROs (including free-text and structured surveys). Each modality is subjected to specialized preprocessing and feature extraction, using domain-appropriate encoders such as deep neural networks, LLMs, temporal CNNs, or autoencoders.

A unifying strategy is to project all modalities into a common D-dimensional latent space, producing a set of embeddings $\{ h^{(m)} \}_{m=1}^M$ which represent the full richness of the patient record (Amoei et al., 2024, Shaik et al., 2023). Free-text PROs, processed via fine-tuned BERT/GPT LLMs, require subsequent pooling or attention-based readout to yield a fixed-size embedding $h^{(PRO)}$ commensurate with those derived from structured data.

Fusion architectures span simple concatenation, feature-level stacking, and more complex mechanisms such as cross-modal attention, hierarchical gating, and graph transformer modules. Alignment between structured and unstructured modalities is often enforced via joint embedding or contrastive losses, which promote semantically similar clinical states and PRO narratives to occupy proximate regions of latent space (Amoei et al., 2024).

2. Fusion Architectures and Mathematical Formulations

Several principled mathematical frameworks underlie multimodal fusion for PRO-centric applications:

Early/Data-level Fusion: Direct vector concatenation or weighted summation of raw or lightly processed features, typically prior to deep representation learning (Shaik et al., 2023).
Feature-level Fusion: Canonical correlation analysis (CCA), joint embedding autoencoders, or multimodal transformers enable discovery of linear or nonlinear associations among modality-specific features. These approaches simultaneously optimize cross-modal coherence and downstream predictive utility.
Decision-level (Late) Fusion: Ensemble of modality-specific classifiers, whose predictions are combined by fixed or learnable weights, voting, or belief function aggregation. Statistical correlation-based weights (e.g., normalized Spearman correlation) can drive this process, ensuring modalities most relevant to the outcome dominate the final decision function (Gu et al., 2024).
Multi-Agent Paradigms: Multi-agent AI architectures instantiate one agent per clinical outcome, each optimizing a dedicated loss, with a meta-agent aggregating agent-level outputs via an attention mechanism to produce global action plans or risk scores (Amoei et al., 2024).
Supervised Coupled Matrix-Tensor Factorization (SCMTF): Integrates static covariates, temporal labs, and sparse PROs into a coupled low-rank decomposition, regularized to enforce sparsity and interpretability. Classifiers are trained on patient–phenotype loadings, supporting both imputation of missing PROs and prediction of outcomes such as medication persistence (Minoccheri et al., 24 Jun 2025).

3. Alignment, Missingness, and Temporal Asynchrony

Aligning heterogeneous sources requires addressing modality-specific time scales (e.g., daily surveys vs. minute-level sensor data), missingness patterns, and data sparsity. Common strategies include time-window aggregation, timestamp synchronization, dynamic time-warping, and segment-based PRO–sensor alignment (Sun et al., 2024, Shaik et al., 2023, Liu et al., 30 Nov 2025).

Missing data, particularly in PRO streams (often >75–88% missing), are handled via masked losses, conditional variational autoencoders, tensor factorization masking ( $\Omega$ -mask), or simply omission at the fusion stage. Advanced approaches incorporate explicit missingness tokens or engineered features—such as daily device wearing percentage—enabling models to learn informative non-random missingness patterns (Liu et al., 30 Nov 2025).

Temporal asynchrony is directly addressed in token-based architectures by representing each observation as a “token” with its native timestamp and modality embedding. Self-attention models then operate over observed sequences, maintaining the full asynchrony of real-world healthcare monitoring (Liu et al., 30 Nov 2025).

4. Outcome-Driven, Bias-Aware, and Explainable Systems

Multimodal integration architectures are optimized toward multi-objective outcomes, each assigned a tunable clinical importance weight in the loss function. Adversarial debiasing terms are incorporated to “scrub” protected attribute signals, though this may risk suppressing clinically relevant disparities, necessitating future adoption of causal inference-derived adjustments (Amoei et al., 2024).

Model explainability mechanisms include:

Statistical weighting bar-plots and SHAP analysis for modality/feature contribution visualization (Gu et al., 2024).
Attention map extraction (in transformer or attention-based models) for interpretability of temporal and modal focus (Liu et al., 30 Nov 2025).
Outputting interpretable low-rank clinical phenotypes with clear loadings on (sparse) PRO and lab features (Minoccheri et al., 24 Jun 2025).
Integration of human-centered measures, such as movement features in pain assessment, to further enhance explainability and patient communication (Gu et al., 2024).

5. Practical Systems: Implementation and Evaluation

Operational systems integrating PROs and other modalities include:

ROAMM-EHR: An end-to-end pipeline linking smartwatch PRO/sensor collection, AWS-based middleware, and Epic EHR dashboards, deploying data synchronization, feature concatenation, cloud push, and clinician-facing visualization/alerting. It supports real-time symptom surveillance, severity-coded alerts, and data drill-down for post-surgical monitoring (Sun et al., 2024).
Cancer RPM Platform: Employs token-based transformers to integrate wearable, survey, demographic, and clinical event data in real time. The model handles asynchronous sampling, missingness, and high-dimensionality natively, achieving AUROC = 0.70 for adverse event prediction. Features derived from PROs (wellness check-ins, QoR-15 items) were identified as top predictors ahead of clinical deterioration (Liu et al., 30 Nov 2025).

Empirical results across studies consistently demonstrate improved predictive performance and actionable insights when PROs are fused with structured and sensor data. In digital-twin architectures, the inclusion of PROs led to +0.08 AUC improvement over baseline models and reduced mean error in PROM prediction by 29%, while agent-level mental health intervention recall improved by 15% (Amoei et al., 2024). Tensor-based PRO integration models imputed highly sparse PRO data with MAE ≈ 0.15 and achieved AUC > 0.80 in longitudinal medication persistence prediction (Minoccheri et al., 24 Jun 2025).

6. Key Challenges and Future Directions

Persistent obstacles include:

Noisy, Sparse, and Missing PROs: High rates of missingness and variable reporting fidelity challenge both representation learning and longitudinal monitoring. Imputation, uncertainty-aware language modeling, and dynamic time alignment are active research areas.
Bias and Real-World Disparities: Adversarial debiasing may inadvertently suppress legitimate heterogeneous effects. More rigorous approaches should incorporate causal inference frameworks to separate unjust biases from outcome-mediating covariate effects (Amoei et al., 2024).
Adaptive and Contextual Fusion: Current concatenation-plus-attention mechanisms treat all modalities uniformly. Potential advances include outcome-driven gating, hierarchical mixture-of-experts, and graph-based cross-modal transformers that enable selective routing and weighting per clinical context (Amoei et al., 2024, Shaik et al., 2023).
Interoperability and Security: Adherence to FHIR/SMART standards, robust encryption, and federated learning paradigms are essential for scalable, privacy-preserving integration across EHRs and institutional boundaries (Sun et al., 2024, Shaik et al., 2023).

A plausible implication is that as transformer architectures and statistical alignment losses mature, the integration of weakly structured or sparse patient voice data (PROs) into clinical decision-making will become standard practice, particularly in the context of remote monitoring, chronic disease management, and personalized medicine.

7. Significance and Outlook

Multimodal integration for PROs constitutes a foundational shift toward patient-centered, precision, and participatory medicine. Architectures unifying EHR, sensor, social, omic, and PRO data via robust mathematical and statistical pipelines consistently yield improvements in predictive accuracy, interpretability, and actionable insight. Patient-reported narratives, once marginalized due to noise or sparsity, are now essential, mathematically integrated channels for detecting emerging psychosocial needs and shaping real-world care strategies. With continued advances in cross-modal fusion, uncertainty quantification, and real-time embedded analytics, multimodal PRO-centric models are positioned to deliver continuous, adaptive, and equitable learning health systems (Amoei et al., 2024, Liu et al., 30 Nov 2025, Minoccheri et al., 24 Jun 2025, Sun et al., 2024, Shaik et al., 2023, Gu et al., 2024).