Papers
Topics
Authors
Recent
Search
2000 character limit reached

Integrity-weighted Cross-modal Completion

Updated 28 November 2025
  • Integrity-weighted Cross-modal Completion is a dynamic fusion mechanism that prioritizes modalities based on input integrity and reconstruction quality.
  • It employs adaptive cross-modal attention to integrate language, acoustic, and visual cues, avoiding static concatenation methods.
  • Empirical evaluations show that the method significantly improves robustness and sentiment prediction accuracy even under severe modality missingness.

Integrity-guided Adaptive Fusion (IF) is a dynamic multimodal fusion mechanism that directs attention and modality selection during prediction based on learned measures of input 1ness ("integrity") and reconstruction quality. Developed as a core component within the Senti-iFusion framework for robust Multimodal Sentiment Analysis (MSA) under conditions of inter- and intra-modality missingness, IF addresses real-world scenarios where language, acoustic, or visual inputs may be partially observed or corrupted. IF eschews simple concatenation or fixed-weight averaging, instead leveraging modality integrity and adaptive cross-modal attention to maximize both the reliability and informativeness of fused predictions (Li et al., 21 Nov 2025).

1. Theoretical Foundation and Design Objectives

A primary objective of Integrity-guided Adaptive Fusion is to improve MSA robustness by dynamically prioritizing modalities that are empirically most "complete" and semantically well-recovered at fusion time. The mechanism fulfills three guiding principles:

  1. Dynamic Modality Prioritization: At each prediction step, select the dominant modality whose observed tokens are least masked or corrupted and whose reconstructed embedding most closely aligns with ground-truth features.
  2. Fallback to Auxiliary Modalities: When the dominant modality's integrity or feature quality is suboptimal, IF enables the model to rely more heavily on semantic cues available in auxiliary modalities.
  3. Cross-modal Attention Mechanism: Rather than relying on concatenation or static weighting, IF applies cross-modal attention, letting data-driven resemblance between query and key representations determine fusion weights.

Completeness is operationalized via a learned integrity score I^m[0,1]\hat{I}_m \in [0,1] for each modality mm, predicting token unmasking at inference. Quality is enforced via dual reconstruction losses at the semantic and feature levels.

2. Mathematical Formulation

2.1 Integrity Scoring and Loss

For each modality mm, the integrity estimation head (2-layer Transformer + linear projection) computes

I^m=Emie(Concat(Xie,u^m))\hat{I}_m = \mathcal{E}^{ie}_m(\mathrm{Concat}(X_{ie}, \hat{u}_m))

where u^m\hat{u}_m are incomplete embeddings and XieX_{ie} is a learned [IE] token.

The module is supervised by

Lie=1Nk=1NI^mkImk22,\mathcal{L}_{ie} = \frac{1}{N}\sum_{k=1}^{N} \|\hat{I}_m^k - I_m^k\|_2^2,

with Imk=1(fraction of masked tokens)I_m^k = 1 - \text{(fraction of masked tokens)}.

2.2 Reconstruction Quality Regularization

No explicit scalar quality score qmq_m is defined; instead, dual-depth completion is enforced via:

  • Feature-level MSE: Lmseg=1Nmu~mum2\mathcal{L}_{mse}^g = \frac{1}{N}\sum_m \|\tilde{u}_m - u_m\|^2
  • Semantic-level MSE: Lmses=1Nmh^mshms2\mathcal{L}_{mse}^s = \frac{1}{N}\sum_m \|\hat{h}_m^s - h_m^s\|^2
  • Feature/semantic-level MI losses:

Lmig=mMI(u~m,um)\mathcal{L}_{mi}^g = -\sum_m MI(\tilde{u}_m, u_m)

Lmis=mMI(h^ms,hms)\mathcal{L}_{mi}^s = -\sum_m MI(\hat{h}_m^s, h_m^s)

These drive shared and completed features toward maximally reconstructing both fine-grained and semantically structured cues required for sentiment prediction.

2.3 Dominant Modality Selection

Batchwise mean integrity guides selection:

μm=1Bi=1BI^m(i)\mu_m = \frac{1}{B} \sum_{i=1}^B \hat{I}_m^{(i)}

The dominant modality is dom=argmaxm{l,a,v}μmdom = \arg\max_{m \in \{l,a,v\}} \mu_m, with remaining modalities as auxiliaries.

2.4 Adaptive Cross-Modal Attention for Fusion

Dominant features hdom1=h~domh_{dom}^1 = \tilde{h}_{dom} are processed by two Transformer layers, yielding hdom3h_{dom}^3. For three fusion steps:

  • Queries QdomQ_{dom} from transformed dominant features
  • Keys/values Km,VmK_m, V_m from auxiliary modality surrogates h^msur\hat{h}_m^{sur}
  • Attention weights:

γm=softmax(QdomKm/dk)\gamma_m = \mathrm{softmax}(Q_{dom} K_m^\top / \sqrt{d_k})

  • Fused representations updated as:

hfusej=hfusej1+m{a1,a2}γmVmh_{fuse}^j = h_{fuse}^{j-1} + \sum_{m \in \{a_1,a_2\}} \gamma_m V_m

Following fusion, special [CLS]-style tokens are prepended and the sequence is passed to a cross-modal Transformer and linear prediction head.

2.5 Two-Stage Progressive Training

  • Stage 1: Minimize Lstage1=αLie+βLrec\mathcal{L}_{stage1} = \alpha \cdot \mathcal{L}_{ie} + \beta \cdot \mathcal{L}_{rec} (with frozen predictor)
  • Stage 2: Optimize

Lstage2=αLie+βLrec+σLpred\mathcal{L}_{stage2} = \alpha \cdot \mathcal{L}_{ie} + \beta \cdot \mathcal{L}_{rec} + \sigma \cdot \mathcal{L}_{pred}

Only after the initial stage are all model parameters, including fusion and predictor, trained jointly.

3. Architectural Composition and Data Flow

The mechanistic workflow of IF within Senti-iFusion is as follows:

  1. Multimodal inputs (Ul,Ua,Uv)(U_l, U_a, U_v) undergo random inter- and intra-modality masking to produce U~\tilde{U}.
  2. Embedding encoders Eembm\mathcal{E}_{emb}^m generate modality-specific embeddings u^m\hat{u}_m.
  3. The integrity estimation module Eiem\mathcal{E}_{ie}^m outputs integrity scores I^m\hat{I}_m.
  4. The completion module:
    • Disentangles each u^m\hat{u}_m into shared (h^ms)(\hat{h}_m^s) and private (h^mp)(\hat{h}_m^p) features.
    • Constructs surrogate h^msur=I^mu^m+(1I^m)(h^m1s+h^m2s)\hat{h}_m^{sur} = \hat{I}_m \cdot \hat{u}_m + (1-\hat{I}_m)\cdot (\hat{h}_{m_1}^s+\hat{h}_{m_2}^s).
    • Decodes surrogates and re-encodes to apply dual-depth losses.
  5. Integrity-guided Adaptive Fusion:
    • Aggregates I^m\hat{I}_m scores over the batch and selects the dominant modality.
    • Processes this dominant modality through Transformer layers and applies attention with auxiliary modalities using learned keys/values.
    • Updates the intermediate fused representation for three fusion rounds.
  6. The concatenated [CLS]-like tokens and representations are input to a cross-modal Transformer, with the resulting pooled token mapped to the final sentiment prediction.

A summary of module sequence and functions is given below:

Step Module Key Output / Action
1 Embedding Encoder (Eembm\mathcal{E}_{emb}^m) Incomplete modality embeddings (u^m\hat{u}_m)
2 Integrity Estimation (Eiem\mathcal{E}_{ie}^m) Integrity scores (I^m\hat{I}_m)
3 Completion/Disentanglement Shared and private features; surrogates
4 Dual-depth Loss Application Semantic and feature-level consistency
5 IF Module Fusion via dominant selection + cross-attention
6 Prediction Head Sentiment prediction (y^\hat{y})

4. Algorithmic Steps and Implementation

The pseudocode for IF is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
for m in {l, a, v}:
    hat_u_m = E_emb^m(Concat([E_], tilde_U_m))
    hat_I_m = E_ie^m(Concat([I_], hat_u_m))  # integrity score
    (hat_h_m^s, hat_h_m^p) = (E^s_m(hat_u_m), E^p_m(hat_u_m))

Compute L_rec^{enc} via similarity & difference losses

for m in {l, a, v}:
    hat_h_m^{sur} = hat_I_m * hat_u_m + (1 - hat_I_m) * sum_{n != m} hat_h_n^s
    tilde_u_m = D_m(hat_h_m^{sur})
    hat_h_m^{s,rec} = E^s_m(tilde_u_m)

Compute dual-depth L_rec^{dec} (MSE + MI)

if epoch <= pretrain_epochs:
    L = alpha * L_ie + beta * L_rec
    # update integrity & completion modules only
else:
    # Adaptive fusion
    mu_m = mean over batch of hat_I_m
    dom = argmax_m mu_m
    h_dom = hat_h_dom^{sur}
    for i in {2, 3}:
        h_dom = E_dom^i(h_dom)
    h_fuse = h_dom
    for j in {1, 2, 3}:
        Q = h_dom @ W_Q
        for each auxiliary m != dom:
            K_m = hat_h_m^{sur} @ W_K^m
            V_m = hat_h_m^{sur} @ W_V^m
            gamma_m = softmax(Q @ K_m.T / sqrt(d_k))
        h_fuse = h_fuse + sum_m gamma_m * V_m
    H_dom = Concat([D_], h_dom)
    H_fuse = Concat([F_], h_fuse)
    h_pred = E_pred(H_dom, H_fuse)
    y_hat = Linear(h_pred)
    L_pred = MSE(y_hat, y_true)
    L = alpha * L_ie + beta * L_rec + sigma * L_pred
    # update all modules end-to-end
end

Exact training configurations include input length T=8T=8, hidden dimension d=128d=128, batch size 64, AdamW optimizer with learning rate 1e-41e\text{-}4 and weight decay 1e-41e\text{-}4, cosine annealing, warm-up, and early stopping. Loss weights: α=0.9\alpha=0.9, β=0.4\beta=0.4, σ=1.0\sigma=1.0; decoder weights λmseg=0.5\lambda_{mse}^g=0.5, λmig=0.4\lambda_{mi}^g=0.4, λmses=0.3\lambda_{mse}^s=0.3, λmis=0.2\lambda_{mi}^s=0.2 (Li et al., 21 Nov 2025).

5. Empirical Results and Ablation Findings

Empirical evaluations demonstrate that IF consistently enhances robustness under modality-missingness, particularly when compared to static fusion approaches. Under a 50%50\% drop rate on MOSI and MOSEI datasets:

  • Omitting integrity-weighted surrogates ("w/o IIR") increases MAE from 1.15541.17401.1554 \to 1.1740 and lowers F1.
  • Eliminating the integrity loss ("w/o Lie\mathcal{L}_{ie}") reduces Acc-7 by \sim3\% (MOSI).
  • Removing dual-depth completion loss ("w/o Lrecdec\mathcal{L}_{rec}^{dec}") decreases F1 by $0.01-0.02$.
  • Bypassing the two-stage strategy degrades both MAE and accuracy.

Robustness analysis under extreme missingness (drop rate 0.9\to 0.9) reveals that IF prevents the model from collapsing to majority-class prediction, preserving fine-grained prediction performance (lower MAE, higher ACC-5 and F1). The essentiality of integrity-guided fusion is established by these ablations; it is the component that enables reliable adaptation to modality reliability at test time, outperforming competitive baselines in all missing-modality settings (Li et al., 21 Nov 2025).

6. Implications and Significance

Integrity-guided Adaptive Fusion redefines modality fusion as an integrity- and quality-aware, sequence-level adaptive process. The mechanism’s dynamic selection and attention-driven fusion repeatedly prove essential for fine-grained sentiment prediction in the presence of substantial multimodal noise or missingness. A plausible implication is that related multimodal inference tasks (e.g., action recognition, emotion detection) may similarly benefit from incorporating both explicit integrity estimation and cross-modal completion with attention, rather than relying on static or imputed representations.

The modular two-stage training—first focusing on stable integrity/quality modeling, then end-to-end fusion—contributes to training stability and overall accuracy. As demonstrated, IF maintains granular predictive performance even as competing techniques degrade under increasing input uncertainty (Li et al., 21 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Integrity-weighted Cross-modal Completion.