Jointly Modeling Inter- & Intra-Modality Dependencies for Multi-modal Learning
Abstract: Supervised multi-modal learning involves mapping multiple modalities to a target label. Previous studies in this field have concentrated on capturing in isolation either the inter-modality dependencies (the relationships between different modalities and the label) or the intra-modality dependencies (the relationships within a single modality and the label). We argue that these conventional approaches that rely solely on either inter- or intra-modality dependencies may not be optimal in general. We view the multi-modal learning problem from the lens of generative models where we consider the target as a source of multiple modalities and the interaction between them. Towards that end, we propose inter- & intra-modality modeling (I2M2) framework, which captures and integrates both the inter- and intra-modality dependencies, leading to more accurate predictions. We evaluate our approach using real-world healthcare and vision-and-language datasets with state-of-the-art models, demonstrating superior performance over traditional methods focusing only on one type of modality dependency.
- Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Vqa: Visual question answering. In Proceedings of the International Conference on Computer Vision (ICCV), 2015.
- Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
- Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences, 2016.
- On the benefits of early fusion in multimodal representation learning. arXiv preprint arXiv:2011.07191, 2020.
- Bradley, A. P. The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognition, 1997.
- Rubi: Reducing unimodal biases for visual question answering. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
- Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Sample selection bias correction theory. In Algorithmic Learning Theory: 19th International Conference, 2008.
- Beyond question-based biases: Assessing multimodal shortcut learning in visual question answering. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
- A transfer-learning approach for accelerated mri using deep neural networks. Magnetic resonance in medicine, 2020.
- Coarse-to-fine vision-language pre-training with fusion in the backbone. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Palm-e: An embodied multimodal language model. In Proceedings of the International Conference on Machine Learning (ICML), 2023.
- On uni-modal feature learning in supervised multi-modal learning. In Proceedings of the International Conference on Machine Learning (ICML), 2023a.
- On uni-modal feature learning in supervised multi-modal learning. In Proceedings of the International Conference on Machine Learning (ICML), 2023b.
- A multilevel mixture-of-experts framework for pedestrian classification. IEEE Transactions on Image Processing, 2011.
- Early vs late fusion in multimodal convolutional neural networks. In 2020 IEEE 23rd international conference on information fusion (FUSION), 2020.
- Index of balanced accuracy: A performance measure for skewed class distributions. In IbPRIA, 2009.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- The rician distribution of noisy mri data. Magnetic resonance in medicine, 1995.
- On integrating a language model into neural machine translation. Computer Speech and Language, 2017.
- Deep residual learning for image recognition. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2016a.
- Identity mappings in deep residual networks. In Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2016b.
- A structural approach to selection bias. Epidemiology, 2004.
- Does my multimodal model learn cross-modal interactions? it’s harder to tell than you might think! In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
- Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. Nature digital medicine, 2020.
- Jakobovski/free-spoken-digit-dataset: v1.0.8, 2018.
- Joint training of deep ensembles fails due to learner collusion. In Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Mimic-iii, a freely accessible critical care database. Scientific data, 2016.
- Mmtm: Multimodal transfer module for cnn fusion. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Multi- and cross-modal semantics beyond vision: Grounding in auditory perception. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2015.
- The hateful memes challenge: Detecting hate speech in multimodal memes. Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Partmix: Regularization strategy to learn part discovery for visible-infrared person re-identification. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Multimodal machine learning in precision health: A scoping review. Nature Digital Medicine, 2022.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.
- Why m heads are better than one: Training a diverse ensemble of deep networks. arXiv preprint arXiv:1511.06314, 2015.
- Align before fuse: Vision and language representation learning with momentum distillation. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Mseg3d: Multi-modal 3d semantic segmentation for autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Multibench: Multiscale benchmarks for multimodal representation learning. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
- Foundations and trends in multimodal machine learning: Principles, challenges, and open questions. arXiv preprint arXiv:2209.03430, 2022.
- High-modality multimodal transformer: Quantifying modality & interaction heterogeneity for high-modality representation learning. Transactions on Machine Learning Research (TMLR), 2023.
- Factorized contrastive learning: Going beyond multi-view redundancy. Advances in Neural Information Processing Systems (NeurIPS), 2024.
- Polyvit: Co-training vision transformers on images, videos and audio. Transactions on Machine Learning Research (TMLR), 2023.
- Cascaded feature network for semantic segmentation of rgb-d images. In Proceedings of the International Conference on Computer Vision (ICCV), 2017.
- Modeling intra- and inter-modal relations: Hierarchical graph contrastive learning for multimodal sentiment analysis. In Proceedings of the 29th International Conference on Computational Linguistics, 2022.
- Contrastive intra-and inter-modality generation for enhancing incomplete multimedia recommendation. In Proceedings of the 31st ACM International Conference on Multimedia, 2023.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Efficient low-rank multimodal fusion with modality-specific factors. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2018.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
- On sensitivity and robustness of normalization schemes to input distribution shifts in automatic MR image diagnosis. In Medical Imaging with Deep Learning (MIDL), 2023.
- Detecting incidental correlation in multimodal learning via latent variable modeling. Transactions on Machine Learning Research (TMLR), 2023.
- Majority vote of diverse classifiers for late fusion. In Structural, Syntactic, and Statistical Pattern Recognition: Joint IAPR International Workshop. Springer, 2014.
- Multimodal integration learning of robot behavior using deep neural networks. Robotics and Autonomous Systems, 2014.
- Modeling intra and inter-modality incongruity for multi-modal sarcasm detection. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
- Balanced multimodal learning via on-the-fly gradient modulation. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Mfas: Multimodal fusion architecture search. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Piczak, K. J. ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia, 2015.
- Benchmarking deep learning models on large healthcare datasets. Journal of biomedical informatics, 2018.
- Rice, S. O. Mathematical analysis of random noise. The Bell System Technical Journal, 1944.
- The nmr phased array. Magnetic resonance in medicine, 1990.
- Accelerated magnetic resonance imaging by adversarial neural network. In DLMIA/ML-CDS@MICCAI, 2017.
- Black holes and white rabbits: Metaphor identification with visual features. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2016.
- Language prior is not the only shortcut: A benchmark for shortcut learning in vqa. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022.
- Flava: A foundational language and vision alignment model. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Integrated multimodal artificial intelligence framework for healthcare applications. Nature Digital Medicine, 2022.
- Nlvr2 visual bias analysis. arXiv preprint arXiv:1909.10411, 2019.
- A corpus of natural language for visual reasoning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2017.
- A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491, 2018.
- Self-supervised learning from a multi-view perspective. In Proceedings of the International Conference on Learning Representations (ICLR), 2020.
- Simulating single-coil mri from the responses of multiple coils. Communications in Applied Mathematics and Computational Science, 2020.
- Centralnet: a multilayer approach for multimodal fusion, 2018.
- What makes training multi-modal classification networks hard? In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2020a.
- Deep multimodal fusion by channel exchanging. In Advances in Neural Information Processing Systems (NeurIPS), 2020b.
- Robot grasp detection using multimodal deep convolutional neural networks. Advances in Mechanical Engineering, 2016.
- To ensemble or not ensemble: When does end-to-end training fail? In Machine Learning and Knowledge Discovery in Databases: European Conference (ECML PKDD), 2021.
- Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks. In Proceedings of the International Conference on Machine Learning (ICML), 2022.
- Multimodal end-to-end autonomous driving. IEEE Transactions on Intelligent Transportation Systems, 2020.
- Admm-net: A deep learning approach for compressive sensing mri. arXiv preprint arXiv:1705.06869, 2017.
- fastmri: An open dataset and benchmarks for accelerated mri. arXiv preprint arXiv:1811.08839, 2018.
- fastmri+: Clinical pathology annotations for knee and brain fully sampled multi-coil mri data. arXiv preprint arXiv:2109.03812, 2021.
- Intra-and inter-modal curriculum for multimodal learning. In Proceedings of the 31st ACM International Conference on Multimedia, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.