Papers
Topics
Authors
Recent
Search
2000 character limit reached

Deep Equilibrium Multimodal Fusion

Updated 25 December 2025
  • Deep Equilibrium Multimodal Fusion is a dynamic recursive approach that uses fixed-point iterations to adaptively integrate multiple modalities.
  • It replaces static early, mid, or late fusion methods with an implicit infinite-depth network that captures both cross-modal and intra-modal interactions.
  • The method leverages Anderson acceleration and implicit differentiation to ensure stable convergence, making it a plug-and-play replacement in multimodal architectures.

Deep Equilibrium Multimodal Fusion (DEQ fusion) is a paradigm for multimodal representation learning that formulates the fusion process as the joint equilibrium of recursive unimodal projections and a dynamic, purified-then-combined fusion operator. Contrasting with conventional early-, mid-, or late-fusion strategies—where the fusion operation is statically predetermined and typically parameterized by a finite, shallow, or deep number of layers—DEQ fusion models cross-modal and intra-modal interactions through an implicit, infinite-depth network that adaptively fuses modal information at all representational levels via fixed-point iteration. This methodology not only unifies modalities in a richly expressive and stable manner but also provides a plug-and-play replacement for existing fusion modules in a wide array of multimodal frameworks. Empirical evaluation demonstrates consistent state-of-the-art performance across diverse modalities and tasks (Ni et al., 2023).

1. Problem Formulation

Given NN modalities, each represented by a modality-specific feature vector xiRdix_i \in \mathbb{R}^{d_i} (collectively x=(x1,...,xN)x = (x_1, ..., x_N)), the task of multimodal fusion is to produce a unified representation zfuseRdz_{\mathrm{fuse}} \in \mathbb{R}^d that: (i) integrates complementary information from each xix_i; (ii) suppresses noise and redundancy; (iii) captures both intra- and inter-modality correlations from low-level to high-level abstractions.

Traditional fusion approaches—early fusion (raw concatenation), mid fusion (feature-level interaction), or late fusion (decision-level merging)—employ a static architecture that may not adapt to complex, context-dependent modality interactions. DEQ fusion instead seeks a dynamic, recursive strategy: output representations are the equilibria of iterative purification and fusion steps, ensuring mutual stabilization and rich expressiveness for both per-modality and fused outputs.

2. Mathematical Framework

2.1 Deep Equilibrium Model Recapitulation

A deep equilibrium (DEQ) model replaces an explicit LL-layer network with a root-finding problem over a weight-tied residual operator fθf_\theta, seeking a fixed-point z=fθ(z;x)z^* = f_\theta(z^*; x). This equivalently involves solving gθ(z;x)=fθ(z;x)z=0g_\theta(z; x) = f_\theta(z; x) - z = 0 for zz^*. Gradients of a downstream loss L\mathcal{L} are obtained using the implicit function theorem:

Lθ=(Jgθ(z))1fθ(z;x)θ(Lz)\frac{\partial\mathcal{L}}{\partial\theta} = -\left(J_{g_\theta}(z^*)\right)^{-1} \cdot \frac{\partial f_\theta(z^*; x)}{\partial\theta} \cdot \left(\frac{\partial \mathcal{L}}{\partial z^*}\right)^\top

2.2 Unimodal Projections

For each modality ii, a “purification” operator fθif_{\theta_i} refines xix_i recursively:

zi[j+1]=fθi(zi[j];xi),zi[0]=0z_i^{[j+1]} = f_{\theta_i}(z_i^{[j]}; x_i),\quad z_i^{[0]}=0

Here, fθi()f_{\theta_i}(\cdot) consists of a succession of GroupNorm, ReLU nonlinearities, and learned weights/biases, transforming zi[j]z_i^{[j]} and xix_i into cleaned, higher-level representations. Its fixed point ziz_i^* is the unimodal equilibrium.

2.3 Purify-then-Combine Fusion Operator

At each recursion jj, a provisional fused state zfuse[j]z_{\mathrm{fuse}}^{[j]} is updated as follows:

  • Gating: For each ii, αi=σ(θα(zfuse[j]+zi[j+1])+bα)\alpha_i = \sigma(\theta_\alpha(z_{\mathrm{fuse}}^{[j]} + z_i^{[j+1]}) + b_\alpha).
  • Purification: zi=αizfuse[j]z'_i = \alpha_i \odot z_{\mathrm{fuse}}^{[j]} (element-wise gating).
  • Combination and Injection: Compute xfuse=kwkxkx_{\mathrm{fuse}} = \sum_k w_k x_k (learned sum), then

z^=θfuseizi+bfuse,zfuse[j+1]=GroupNorm(ReLU(z^+xfuse))\hat{z} = \theta_{\mathrm{fuse}}\sum_i z'_i + b_{\mathrm{fuse}},\quad z_{\mathrm{fuse}}^{[j+1]} = \mathrm{GroupNorm}(\mathrm{ReLU}(\hat{z} + x_{\mathrm{fuse}}))

This fusion operator thus models non-linear, dynamic modality selection and interaction at each step.

2.4 Joint Equilibrium Formulation

The DEQ fusion problem seeks solutions to all residuals simultaneously:

zi=fθi(zi;xi)for i=1,...,N;zfuse=ffuse(zfuse;x)z_i^* = f_{\theta_i}(z_i^*; x_i)\quad\text{for }i=1,...,N;\qquad z_{\mathrm{fuse}}^* = f_{\mathrm{fuse}}(z_{\mathrm{fuse}}^*; x)

Unimodal projections act independently; only the fusion operator introduces modality interconnections.

2.5 Implicit Modeling of Cross-Layer Correlations

At shallow iterations, zfuse[j]z_{\mathrm{fuse}}^{[j]} encodes local, low-level modality alignment; as jj\to\infty, high-level semantic interactions are iteratively composed into a unified representation. The equilibrium zfusez_{\mathrm{fuse}}^* thus encodes information from the effective infinite-depth stack of both unimodal and fused projections.

3. Solver, Training, and Regularization

3.1 Fixed-Point Solver: Anderson Acceleration

Solving g(z)=0g(z)=0 for (possibly high-dimensional) zz is performed with Anderson acceleration, which generalizes Broyden’s method and is suited for vector-valued fixed-point problems. At each iteration, Anderson acceleration uses MM previous pairs {(z(j),y(j))}\{(z^{(j)}, y^{(j)})\} to extrapolate future updates, enhancing convergence. Convergence criteria are typically z(k+1)z(k)/z(k)<ϵ\|z^{(k+1)}-z^{(k)}\|/\|z^{(k)}\| < \epsilon for tolerance ϵ\epsilon.

3.2 Implicit Differentiation

Once the equilibrium zz^* is found, parameter gradients are computed indirectly via the implicit function theorem. Each θi\theta_i and θfuse\theta_{\mathrm{fuse}} are updated using derivatives governed by the equilibrium dynamics. For all modality parameters:

Lθi=(Jgθi(zi))1fθi(zi;xi)θi[zizfuseLzfuse]\frac{\partial\mathcal{L}}{\partial\theta_i} = -\left(J_{g_{\theta_i}}(z_i^*)\right)^{-1} \cdot \frac{\partial f_{\theta_i}(z_i^*; x_i)}{\partial\theta_i} \cdot \left[\frac{\partial z_i^*}{\partial z_{\mathrm{fuse}}^*}\frac{\partial\mathcal{L}}{\partial z_{\mathrm{fuse}}^*}\right]^\top

3.3 Losses and Regularization

Standard downstream losses (e.g., cross-entropy, MAE) are used. Jacobian regularization (Jg(z)F2\|J_g(z^*)\|_F^2, weighted 0.01–20) controls the tradeoff between expressivity and stability. Dropout and early stopping are applied on smaller datasets (e.g., BRCA) for overfitting mitigation. Learning rates are decoupled, commonly using lower rates (e.g., 10410^{-4}) for fusion and higher rates (e.g., 10310^{-3}) for encoders.

4. Architectural and Computational Details

4.1 Integration and Hyperparameters

DEQ fusion layers can directly replace previous fusion blocks, such as concatenation plus MLP, bilinear pooling, or attention modules. Upstream/unimodal encoders and downstream prediction heads require no modification. Common hyperparameters include:

  • Latent dimension dd set to backbone feature size (e.g., 512 or 1024)
  • Anderson memory M=5M = 5–10
  • Solver steps K=50K = 50–100 (typ. K20K \approx 20–30 at inference)
  • Convergence tolerance ϵ=103\epsilon = 10^{-3}
  • Jacobian regularization weight in [0.01,20][0.01, 20]
  • Batch size as permitted by hardware, since memory usage is nearly constant per step

4.2 Computational Complexity

Each solver iteration entails NN unimodal block evaluations and a fused block evaluation, yielding cost O(KC)O(KC) where CC is per-layer computation. Despite iterative computation, memory usage is approximately constant: intermediate states need not be stored, only current layer activations and solver state, which supports larger effective depths without incurring memory penalties typical of explicit deep stacks.

5. Experimental Evaluation

Empirical validation encompasses five benchmarks across various modalities and tasks, with original unimodal encoders retained and only the fusion module swapped for DEQ fusion.

Dataset Modalities SOTA Backbone Metrics Baseline +DEQ Fusion Main Gain
BRCA mRNA, DNAm, miRNA MM-Dynamics Acc, wF1, mF1 87.7, 88.0, 84.5 89.1, 89.7, 87.6 +1.4% Acc, +1.7% macro-F1
MM-IMDB Poster, Text Late-fusion baseline μF1, mF1 59.02, 50.27 61.52, 53.38 +2.50pp μF1, +3.11pp mF1
CMU-MOSI Audio, Text CM-BERT Acc-7, Acc-2, F1 44.9, 84.5, 84.5 46.1, 85.4, 85.4 SOTA on all (plus corr, MAE improves)
SUN RGB-D RGB, Point-Cloud ImVoteNet [email protected], [email protected] 61.9, 45.6 62.7, 46.4 +0.8 mAP points both
VQA-v2 Image, Text Mutan, MCAN Yes/No, Number, Other, Overall Mutan: 63.73, MCAN: 67.02 Mutan: 64.57, MCAN: 67.38 Consistent accuracy improvement

Ablative studies (BRCA) indicate that disabling iterative equilibrium, fusion, or unimodal purification degrades performance. Gating is essential to peak accuracy. Convergence is reliable within 20 Anderson steps, rapidly stabilizing the equilibrium.

6. Analysis, Significance, and Limitations

DEQ fusion exhibits several distinctive properties:

  • Adaptive recursion: Instead of prespecified network depth, the fixed-point formulation recurses only as needed for each instance, emulating an effectively infinite-depth network.
  • Joint stability: By equilibrating unimodal and fusion outputs, feature representations become mutually consistent and less prone to drift or instability, even as the feature combination process recursively evolves.
  • Dynamic gating: Modality-specific gating adaptively weights contributions in every iteration, allowing the model to ignore irrelevant or redundant features dynamically.
  • Hierarchical correlation modeling: Information from all recursion depths and their interactions are folded into the final equilibrium state, allowing representations to encode both fine-grained and abstract multimodal interactions.

DEQ fusion is particularly effective where modalities interact in complex, nonlinear ways, and where static fusion architecture shows either underfitting or overfitting tendencies. Nevertheless, the approach introduces additional solver overhead per inference step (although warm-starting and acceleration reduce practical impact), demands careful regularization and learning rate control, and may require further stabilization (e.g., spectral norm constraints) for arbitrarily complex operators.

Potential research avenues include combining DEQ fusion with large, pretrained multimodal backbones, integrating faster or learned solving strategies, and hybrid implicit–explicit architectures where only selected fusion layers are equilibrated (Ni et al., 2023).

In summary, Deep Equilibrium Multimodal Fusion delivers a dynamic, recursive, and plug-and-play framework for modality unification, attaining state-of-the-art results across diverse multimodal challenges while leveraging a mathematically principled equilibrium architecture.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deep Equilibrium Multimodal Fusion.