Intermediate Multimodal Fusion

Updated 22 January 2026

Intermediate multimodal fusion is a method that combines modality-specific features at an intermediate stage to preserve specialized information while enabling rich cross-modal interactions.
It employs diverse fusion operators such as concatenation, gating, and attention, balancing modeling capacity, computational cost, and interpretability across various applications.
Empirical evaluations demonstrate that intermediate fusion can boost performance metrics by maintaining fine-grained alignments and enhancing robustness against modality-specific noise.

Intermediate multimodal fusion refers to a class of techniques in multimodal machine learning where modality-specific features, extracted by dedicated encoders, are merged at a feature or latent representation level before downstream prediction. Unlike early (data-level) fusion, which combines raw inputs, and late (decision-level) fusion, which integrates standalone modality outputs, intermediate fusion operates within the feature hierarchy, facilitating rich cross-modal interactions while retaining modality-specialized processing. This architectural paradigm is prominent across domains such as computer vision, medical imaging, biomedical signal analysis, time series forecasting, autonomous driving, and natural language processing, with an expanding suite of fusion operators, alignment mechanisms, and mutual learning strategies underpinning its empirical and theoretical advantages.

1. Principles and Formalization of Intermediate Fusion

Intermediate fusion defines a structural stage between initial modality encoding and final prediction, where feature representations $h_i = f_i(x_i)$ from each modality $i$ are combined by an explicit fusion function $\mathscr{F}$ : $h = \mathscr{F}(h_1, h_2, ..., h_n)$ followed by a multimodal head $f(h)$ for prediction (Guarrasi et al., 2024, Li et al., 2024). The architecture preserves modality-specific learning before feature mixing, and it enables flexible information exchange (concatenation, gating, attention, tensor product, cross-attention, or learned calibration). This approach balances the expressivity of joint modeling with the modularity of unimodal pretraining, mitigating the risks of information loss inherent in early fusion and the late loss of interaction in output-level fusion. Structured notation formalizes this design: ${\,i\,} = \bullet\! \left( L^{[l]}_{\alpha_j}, L^{[m]}_{\alpha_k}, ... \right)_\rightarrow$ with $\bullet$ denoting the fusion operator, $L^{[\ell]}_{\alpha_j}$ indicating how many layers the input $\alpha_j$ has propagated through, and the subscript $_\rightarrow$ marking the main fusion (Guarrasi et al., 2024).

2. Taxonomy of Intermediate Fusion Techniques

A range of fusion methods and modules have been developed, each supporting a different trade-off between modeling capacity, computational overhead, and interpretability:

Fusion Operation	Mathematical Formulation	Characteristic
Concatenation	$i$ 0	Simple and robust
Gating (SE, GMU)	$i$ 1, $i$ 2	Adaptive modality weighting
Attention-based	$i$ 3, $i$ 4	Contextual, dynamically learned fusion
Bilinear/Tensor Fusion	$i$ 5 or $i$ 6	Captures cross-modal multiplicative interactions
Graph-based Fusion	$i$ 7	Structured, suitable for variable or missing modalities

Concatenation-based fusion remains the default for many scenarios, providing clarity and compatibility with pretrained backbones (Guarrasi et al., 2024, Joze et al., 2019). Gating mechanisms (e.g., MMTM (Joze et al., 2019), squeeze-and-excitation) recalibrate modalities based on global context. Attention modules—including cross-attention, co-attention, and transformer blocks—facilitate fine-grained, content-adaptive fusion but at increased computational cost (Huo et al., 2023, Li et al., 2024). Tensor-based operators (TFN, bilinear pooling) capture higher-order feature interactions, while graph neural networks address missing data or relational invariances.

3. Exemplary Architectures and Domain-Specific Implementations

Intermediate fusion architectures are instantiated in diverse modalities and tasks:

Medical Imaging and Biomedical Applications: Multi-stage residual 3D fusion (CT+PET) with voxel-wise operations (Aksu et al., 21 Jan 2025), CNN-Transformer hybrids for ECG classification (Oladunni et al., 6 Aug 2025), dimensionality-reduced CNN fusion for stress detection (Bodaghi et al., 2024), and self/cross-attention networks for HSI+LiDAR (Huo et al., 2023). In each, modality-specific encoders (e.g., 2D ResNet + 3D ResNet, 1D-CNN + 2D-CNN) process raw data before fusion at the post-encoding/post-pooling stage, followed by a unified classifier.
Time Series and Mixed Data: Independent LSTM encoding of synchronous/irregular streams, followed by fusion via concatenation, gating, or feature sharing (Dietz et al., 2024). The model choice is guided by intermodal interaction strength: concatenation for robust integration, gating for adaptivity.
Vision-Language and Diffusion Models: U-shaped ViT backbones that perform early image-only processing, mid-stage cross-attention, and joint-fusion blocks for text-image generation, yielding gains in both sample efficiency and alignment metrics (Hu et al., 2024).
Autonomous Driving and 3D Detection: Intermediate fusion (e.g., mmFUSION) synchronizes image and LiDAR features via 3D convolutional attention, outperforming early (voxel-level) and late (RoI-proposal) strategies (Ahmad et al., 2023).

4. Empirical Benefits and Comparative Evaluations

Intermediate fusion consistently outperforms simple early and late fusion in scenarios where:

Cross-modal interactions are non-trivial and hierarchical information exchange is beneficial.
Fine alignment between spatial, temporal, or semantic features can be preserved (e.g., voxel-wise 3D fusion in imaging (Aksu et al., 21 Jan 2025, Ahmad et al., 2023)).
Robustness to modality-specific noise, missingness, and domain shift is required (e.g., stress detection, mental health phenotyping (Bodaghi et al., 2024, Barkat et al., 10 Jul 2025)).
The application necessitates both strong unimodal representation and emergent joint modeling (e.g., author intent detection (Islam et al., 28 Nov 2025), sentiment analysis, speech enhancement).

Quantitative gains include increases of 4–8 F1 or AUC points (Bangla author intent (Islam et al., 28 Nov 2025), medical image classification (Aksu et al., 21 Jan 2025, Li et al., 2024)), 1–2% accuracy in VQA/multimodal retrieval (Li et al., 2024), and improved interpretability and stability via attention weights or saliency alignment (Oladunni et al., 6 Aug 2025).

Ablation studies confirm that replacing intermediate fusion with concatenation, summation, or late fusion reduces performance, especially for complex, real-world multimodal tasks (Aksu et al., 21 Jan 2025, Huo et al., 2023, Sun et al., 14 Sep 2025).

5. Challenges, Limitations, and Trends

Challenges in intermediate fusion include:

Feature Alignment & Dimensionality: Different-sized embeddings may bias fusion, requiring projection or normalization layers (Guarrasi et al., 2024).
Computational Cost: Attention-based and bilinear modules can be expensive ( $i$ 8 or higher), mitigated by low-rank approximations or multi-stage pruning (Hu et al., 2024).
Overfitting in Low-Data Regimes: Sophisticated multi-stage or attention blocks may over-parameterize on small biomedical datasets (Li et al., 2024).
Interpretability: Attention and gating weights provide some post hoc insight, but tensor interactions complicate mechanistic tracing.
Handling Missing Modalities: Simple concatenation fails if modalities are absent; attention or graph-based models enable more robust aggregation (Guarrasi et al., 2024, Li et al., 2024).

Emergent best practices include dimensionality equalization, late insertion of fusion layers for high-level alignment, layer-wise ablation, and explicit empirical justification for chosen fusion operators (Joze et al., 2019, Liang et al., 27 Jul 2025).

6. Extensions: Adaptive, Hierarchical, and Mutual Learning Fusion

Recent trends expand the scope of intermediate fusion:

Hierarchical and Multi-stage Fusion: Fusion repeated at multiple feature depths captures information across abstraction levels (e.g., residual multi-stage 3D convolutional fusion in CT+PET (Aksu et al., 21 Jan 2025), MMTM “slow fusion” (Joze et al., 2019)).
Adaptive and Mutual Learning: Fusion operators are learned, not fixed—using bottleneck compression (Auto-Fusion (Sahu et al., 2019)), GAN-regularized latent spaces (GAN-Fusion (Sahu et al., 2019)), or soft mutual learning within a cohort of intermediate-fusion models (Meta Fusion (Liang et al., 27 Jul 2025)).
Equilibrium Fusion: Deep equilibrium methods model feature fusion as a dynamical fixed-point, recursively and adaptively integrating intra- and inter-modal cues at all depths, enhancing generalization and performance (Ni et al., 2023).
Attention-Driven and Graph-Based Fusion: Transformer-based joint encoding, cross-attention, and graph neural network aggregation allow for token- or instance-level fusion, highly performant but computationally demanding (Huo et al., 2023, Li et al., 2024).

7. Applications, Future Directions, and Outlook

Intermediate fusion has been deployed for:

Disease classification with structured biomedical, imaging, and genetic data (Guarrasi et al., 2024, Aksu et al., 21 Jan 2025, Li et al., 2022)
Remote sensing and environmental monitoring with multisensor time series (Huo et al., 2023, Dietz et al., 2024)
Multimodal sentiment, emotion, and author intent analysis (Sahu et al., 2019, Islam et al., 28 Nov 2025)
Robust estimation in low-resource and missing-modality settings (Guarrasi et al., 2024, Barkat et al., 10 Jul 2025)

Emerging research targets scalable multi-modality in high-dimensional fusion, missing-data imputation, self-supervised cross-modal pretraining, and joint interpretability-robustness metrics. Theoretical efforts are focusing on signal-plus-noise decomposition and generalization error guarantees, particularly for mutual information sharing and equilibrium-based architectures (Liang et al., 27 Jul 2025, Ni et al., 2023).

Open questions include the optimal positioning and type of fusion module within deep architectures, generalizability across domain shifts, and the trade-off between computational scalability and alignment of semantically meaningful feature interactions.

References (arXiv IDs): (Joze et al., 2019, Huo et al., 2023, Li et al., 2024, Dietz et al., 2024, Aksu et al., 21 Jan 2025, Bodaghi et al., 2024, Sun et al., 14 Sep 2025, Liang et al., 27 Jul 2025, Oladunni et al., 6 Aug 2025, Guarrasi et al., 2024, Li et al., 2024, Ahmad et al., 2023, Barkat et al., 10 Jul 2025, Sahu et al., 2019, Hu et al., 2024, Li et al., 2022, Islam et al., 28 Nov 2025, Ni et al., 2023).