Characterizing the Predictive Impact of Modalities with Supervised Latent-Variable Modeling

Published 19 Feb 2026 in cs.CV, cs.CL, and cs.LG | (2602.16979v1)

Abstract: Despite the recent success of Multimodal LLMs (MLLMs), existing approaches predominantly assume the availability of multiple modalities during training and inference. In practice, multimodal data is often incomplete because modalities may be missing, collected asynchronously, or available only for a subset of examples. In this work, we propose PRIMO, a supervised latent-variable imputation model that quantifies the predictive impact of any missing modality within the multimodal learning setting. PRIMO enables the use of all available training examples, whether modalities are complete or partial. Specifically, it models the missing modality through a latent variable that captures its relationship with the observed modality in the context of prediction. During inference, we draw many samples from the learned distribution over the missing modality to both obtain the marginal predictive distribution (for the purpose of prediction) and analyze the impact of the missing modalities on the prediction for each instance. We evaluate PRIMO on a synthetic XOR dataset, Audio-Vision MNIST, and MIMIC-III for mortality and ICD-9 prediction. Across all datasets, PRIMO obtains performance comparable to unimodal baselines when a modality is fully missing and to multimodal baselines when all modalities are available. PRIMO quantifies the predictive impact of a modality at the instance level using a variance-based metric computed from predictions across latent completions. We visually demonstrate how varying completions of the missing modality result in a set of plausible labels.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces PRIMO, a supervised latent-variable model that quantifies the predictive impact of missing modalities via latent sampling.
It employs evidence lower bounds and variance-based metrics to analyze prediction changes under incomplete modality data.
Evaluations on synthetic, AV-MNIST, and MIMIC-III datasets demonstrate PRIMO's effectiveness in guiding robust multimodal inference.

Characterizing the Predictive Impact of Modalities with Supervised Latent-Variable Modeling

Introduction

The paper “Characterizing the Predictive Impact of Modalities with Supervised Latent-Variable Modeling” explores the challenges and methodologies associated with multimodal learning, where not all modalities may be available either during training or inference. The authors introduce PRIMO, a supervised latent-variable imputation model designed to assess the predictive significance of missing modalities. Unlike existing works, PRIMO leverages the partial availability of data and models missing modalities as latent variables, facilitating both prediction and impact analysis.

Methodology

PRIMO Architecture

PRIMO is structured to quantify how missing modalities affect predictions by introducing a latent variable representation. This approach diverges from traditional methods that treat the absence of data as an imputation task. Instead, PRIMO focuses on understanding the variance in predictions due to missing input modalities. It achieves this by sampling from the learned latent distribution, enhancing the prediction strategy in scenarios where modality information is partially unavailable.

Figure 1: An overview of PRIMO illustrating modality impact analysis via latent sampling.

Learning Objective

The learning process in PRIMO is designed to accommodate both complete and incomplete modality data. By modeling a conditional prior over latent variables, the system quantifies the predictive value of different modalities. The use of evidence lower bounds (ELBOs) ensures that the training accommodates the presence or absence of certain modalities effectively. PRIMO’s ability to derive predictive distributions from partial data without complete reliance on generative reconstructions distinguishes it from prior multimodal approaches.

Evaluation Metrics

PRIMO utilizes a variance-based metric to evaluate how predictions change across latent completions. This allows for an instance-level assessment of modality impact, facilitating a fine-grained understanding of the multimodal dataset structure.

Experiments

The efficacy of PRIMO is demonstrated through synthetic and real datasets. The experiments include scenarios such as synthetic XOR problems, Audio-Vision MNIST (AV-MNIST), and clinical datasets like MIMIC-III.

Synthetic XOR and AV-MNIST

In synthetic XOR datasets, PRIMO aligns with unimodal baseline performance when modalities are missing, validating its discriminative modeling prowess. Results on AV-MNIST further showcase PRIMO's effectiveness in retaining prediction accuracy across missing modality scenarios, reinforcing its robustness in handling real-world data variance.

Figure 2: Evaluation of PRIMO on the XOR dataset, highlighting its resilience in modality-missing scenarios.

MIMIC-III Dataset

PRIMO exhibits notable performance in MIMIC-III, a critical care dataset featuring modalities like patient demographics and clinical time-series data. The model's capability to differentiate modality importance based on task demands—such as mortality prediction or ICD-9 code classification—demonstrates its potential for targeted clinical decision support.

Figure 3: Predictive impact visualizations on MIMIC-III indicating the criticality of different modalities.

Implications and Future Work

The proposed approach with PRIMO has significant implications for the interpretation and deployment of AI in fields where data completeness cannot be guaranteed. The model’s ability to elucidate the impact of potential data acquisition (and its associated costs and risks) suggests practical applications in healthcare to optimize diagnostic processes.

Future research could explore extending PRIMO to accommodate more complex, larger-scale multimodal tasks or incorporate active learning techniques to further streamline feature acquisition strategies.

Conclusion

This paper presents a comprehensive look at a novel methodology for approaching multimodal learning under missing data contexts. Through the use of a supervised latent-variable framework, PRIMO offers a robust mechanism to quantify the predictive power of incomplete modalities, providing critical insights into data dependency and sharing pathways to improve both model interpretability and practical application efficacy across diverse domains.


The essay summarizes the paper's contributions and methodologies, specifically highlighting the unique latent-variable modeling approach of PRIMO, its experimental validation across multiple datasets, and its implications for future research in AI and multimodal learning contexts.

Markdown Report Issue