Deep Equilibrium Multimodal Fusion

Updated 25 December 2025

Deep Equilibrium Multimodal Fusion is a dynamic recursive approach that uses fixed-point iterations to adaptively integrate multiple modalities.
It replaces static early, mid, or late fusion methods with an implicit infinite-depth network that captures both cross-modal and intra-modal interactions.
The method leverages Anderson acceleration and implicit differentiation to ensure stable convergence, making it a plug-and-play replacement in multimodal architectures.

Deep Equilibrium Multimodal Fusion (DEQ fusion) is a paradigm for multimodal representation learning that formulates the fusion process as the joint equilibrium of recursive unimodal projections and a dynamic, purified-then-combined fusion operator. Contrasting with conventional early-, mid-, or late-fusion strategies—where the fusion operation is statically predetermined and typically parameterized by a finite, shallow, or deep number of layers—DEQ fusion models cross-modal and intra-modal interactions through an implicit, infinite-depth network that adaptively fuses modal information at all representational levels via fixed-point iteration. This methodology not only unifies modalities in a richly expressive and stable manner but also provides a plug-and-play replacement for existing fusion modules in a wide array of multimodal frameworks. Empirical evaluation demonstrates consistent state-of-the-art performance across diverse modalities and tasks (Ni et al., 2023).

1. Problem Formulation

Given $N$ modalities, each represented by a modality-specific feature vector $x_i \in \mathbb{R}^{d_i}$ (collectively $x = (x_1, ..., x_N)$ ), the task of multimodal fusion is to produce a unified representation $z_{\mathrm{fuse}} \in \mathbb{R}^d$ that: (i) integrates complementary information from each $x_i$ ; (ii) suppresses noise and redundancy; (iii) captures both intra- and inter-modality correlations from low-level to high-level abstractions.

Traditional fusion approaches—early fusion (raw concatenation), mid fusion (feature-level interaction), or late fusion (decision-level merging)—employ a static architecture that may not adapt to complex, context-dependent modality interactions. DEQ fusion instead seeks a dynamic, recursive strategy: output representations are the equilibria of iterative purification and fusion steps, ensuring mutual stabilization and rich expressiveness for both per-modality and fused outputs.

2. Mathematical Framework

2.1 Deep Equilibrium Model Recapitulation

A deep equilibrium (DEQ) model replaces an explicit $L$ -layer network with a root-finding problem over a weight-tied residual operator $f_\theta$ , seeking a fixed-point $z^* = f_\theta(z^*; x)$ . This equivalently involves solving $g_\theta(z; x) = f_\theta(z; x) - z = 0$ for $z^*$ . Gradients of a downstream loss $x_i \in \mathbb{R}^{d_i}$ 0 are obtained using the implicit function theorem:

$x_i \in \mathbb{R}^{d_i}$ 1

2.2 Unimodal Projections

For each modality $x_i \in \mathbb{R}^{d_i}$ 2, a “purification” operator $x_i \in \mathbb{R}^{d_i}$ 3 refines $x_i \in \mathbb{R}^{d_i}$ 4 recursively:

$x_i \in \mathbb{R}^{d_i}$ 5

Here, $x_i \in \mathbb{R}^{d_i}$ 6 consists of a succession of GroupNorm, ReLU nonlinearities, and learned weights/biases, transforming $x_i \in \mathbb{R}^{d_i}$ 7 and $x_i \in \mathbb{R}^{d_i}$ 8 into cleaned, higher-level representations. Its fixed point $x_i \in \mathbb{R}^{d_i}$ 9 is the unimodal equilibrium.

2.3 Purify-then-Combine Fusion Operator

At each recursion $x = (x_1, ..., x_N)$ 0, a provisional fused state $x = (x_1, ..., x_N)$ 1 is updated as follows:

Gating: For each $x = (x_1, ..., x_N)$ 2, $x = (x_1, ..., x_N)$ 3.
Purification: $x = (x_1, ..., x_N)$ 4 (element-wise gating).
Combination and Injection: Compute $x = (x_1, ..., x_N)$ 5 (learned sum), then

$x = (x_1, ..., x_N)$ 6

This fusion operator thus models non-linear, dynamic modality selection and interaction at each step.

2.4 Joint Equilibrium Formulation

The DEQ fusion problem seeks solutions to all residuals simultaneously:

$x = (x_1, ..., x_N)$ 7

Unimodal projections act independently; only the fusion operator introduces modality interconnections.

2.5 Implicit Modeling of Cross-Layer Correlations

At shallow iterations, $x = (x_1, ..., x_N)$ 8 encodes local, low-level modality alignment; as $x = (x_1, ..., x_N)$ 9, high-level semantic interactions are iteratively composed into a unified representation. The equilibrium $z_{\mathrm{fuse}} \in \mathbb{R}^d$ 0 thus encodes information from the effective infinite-depth stack of both unimodal and fused projections.

3. Solver, Training, and Regularization

3.1 Fixed-Point Solver: Anderson Acceleration

Solving $z_{\mathrm{fuse}} \in \mathbb{R}^d$ 1 for (possibly high-dimensional) $z_{\mathrm{fuse}} \in \mathbb{R}^d$ 2 is performed with Anderson acceleration, which generalizes Broyden’s method and is suited for vector-valued fixed-point problems. At each iteration, Anderson acceleration uses $z_{\mathrm{fuse}} \in \mathbb{R}^d$ 3 previous pairs $z_{\mathrm{fuse}} \in \mathbb{R}^d$ 4 to extrapolate future updates, enhancing convergence. Convergence criteria are typically $z_{\mathrm{fuse}} \in \mathbb{R}^d$ 5 for tolerance $z_{\mathrm{fuse}} \in \mathbb{R}^d$ 6.

3.2 Implicit Differentiation

Once the equilibrium $z_{\mathrm{fuse}} \in \mathbb{R}^d$ 7 is found, parameter gradients are computed indirectly via the implicit function theorem. Each $z_{\mathrm{fuse}} \in \mathbb{R}^d$ 8 and $z_{\mathrm{fuse}} \in \mathbb{R}^d$ 9 are updated using derivatives governed by the equilibrium dynamics. For all modality parameters:

$x_i$ 0

3.3 Losses and Regularization

Standard downstream losses (e.g., cross-entropy, MAE) are used. Jacobian regularization ( $x_i$ 1, weighted 0.01–20) controls the tradeoff between expressivity and stability. Dropout and early stopping are applied on smaller datasets (e.g., BRCA) for overfitting mitigation. Learning rates are decoupled, commonly using lower rates (e.g., $x_i$ 2) for fusion and higher rates (e.g., $x_i$ 3) for encoders.

4. Architectural and Computational Details

4.1 Integration and Hyperparameters

DEQ fusion layers can directly replace previous fusion blocks, such as concatenation plus MLP, bilinear pooling, or attention modules. Upstream/unimodal encoders and downstream prediction heads require no modification. Common hyperparameters include:

Latent dimension $x_i$ 4 set to backbone feature size (e.g., 512 or 1024)
Anderson memory $x_i$ 5–10
Solver steps $x_i$ 6–100 (typ. $x_i$ 7–30 at inference)
Convergence tolerance $x_i$ 8
Jacobian regularization weight in $x_i$ 9
Batch size as permitted by hardware, since memory usage is nearly constant per step

4.2 Computational Complexity

Each solver iteration entails $L$ 0 unimodal block evaluations and a fused block evaluation, yielding cost $L$ 1 where $L$ 2 is per-layer computation. Despite iterative computation, memory usage is approximately constant: intermediate states need not be stored, only current layer activations and solver state, which supports larger effective depths without incurring memory penalties typical of explicit deep stacks.

5. Experimental Evaluation

Empirical validation encompasses five benchmarks across various modalities and tasks, with original unimodal encoders retained and only the fusion module swapped for DEQ fusion.

Dataset	Modalities	SOTA Backbone	Metrics	Baseline	+DEQ Fusion	Main Gain
BRCA	mRNA, DNAm, miRNA	MM-Dynamics	Acc, wF1, mF1	87.7, 88.0, 84.5	89.1, 89.7, 87.6	+1.4% Acc, +1.7% macro-F1
MM-IMDB	Poster, Text	Late-fusion baseline	μF1, mF1	59.02, 50.27	61.52, 53.38	+2.50pp μF1, +3.11pp mF1
CMU-MOSI	Audio, Text	CM-BERT	Acc-7, Acc-2, F1	44.9, 84.5, 84.5	46.1, 85.4, 85.4	SOTA on all (plus corr, MAE improves)
SUN RGB-D	RGB, Point-Cloud	ImVoteNet	mAP@0.25, [email protected]	61.9, 45.6	62.7, 46.4	+0.8 mAP points both
VQA-v2	Image, Text	Mutan, MCAN	Yes/No, Number, Other, Overall	Mutan: 63.73, MCAN: 67.02	Mutan: 64.57, MCAN: 67.38	Consistent accuracy improvement

Ablative studies (BRCA) indicate that disabling iterative equilibrium, fusion, or unimodal purification degrades performance. Gating is essential to peak accuracy. Convergence is reliable within 20 Anderson steps, rapidly stabilizing the equilibrium.

6. Analysis, Significance, and Limitations

DEQ fusion exhibits several distinctive properties:

Adaptive recursion: Instead of prespecified network depth, the fixed-point formulation recurses only as needed for each instance, emulating an effectively infinite-depth network.
Joint stability: By equilibrating unimodal and fusion outputs, feature representations become mutually consistent and less prone to drift or instability, even as the feature combination process recursively evolves.
Dynamic gating: Modality-specific gating adaptively weights contributions in every iteration, allowing the model to ignore irrelevant or redundant features dynamically.
Hierarchical correlation modeling: Information from all recursion depths and their interactions are folded into the final equilibrium state, allowing representations to encode both fine-grained and abstract multimodal interactions.

DEQ fusion is particularly effective where modalities interact in complex, nonlinear ways, and where static fusion architecture shows either underfitting or overfitting tendencies. Nevertheless, the approach introduces additional solver overhead per inference step (although warm-starting and acceleration reduce practical impact), demands careful regularization and learning rate control, and may require further stabilization (e.g., spectral norm constraints) for arbitrarily complex operators.

Potential research avenues include combining DEQ fusion with large, pretrained multimodal backbones, integrating faster or learned solving strategies, and hybrid implicit–explicit architectures where only selected fusion layers are equilibrated (Ni et al., 2023).

In summary, Deep Equilibrium Multimodal Fusion delivers a dynamic, recursive, and plug-and-play framework for modality unification, attaining state-of-the-art results across diverse multimodal challenges while leveraging a mathematically principled equilibrium architecture.

Markdown Report Issue Upgrade to Chat

References (1)

Deep Equilibrium Multimodal Fusion (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deep Equilibrium Multimodal Fusion.