Integrity-weighted Cross-modal Completion
- Integrity-weighted Cross-modal Completion is a dynamic fusion mechanism that prioritizes modalities based on input integrity and reconstruction quality.
- It employs adaptive cross-modal attention to integrate language, acoustic, and visual cues, avoiding static concatenation methods.
- Empirical evaluations show that the method significantly improves robustness and sentiment prediction accuracy even under severe modality missingness.
Integrity-guided Adaptive Fusion (IF) is a dynamic multimodal fusion mechanism that directs attention and modality selection during prediction based on learned measures of input 1ness ("integrity") and reconstruction quality. Developed as a core component within the Senti-iFusion framework for robust Multimodal Sentiment Analysis (MSA) under conditions of inter- and intra-modality missingness, IF addresses real-world scenarios where language, acoustic, or visual inputs may be partially observed or corrupted. IF eschews simple concatenation or fixed-weight averaging, instead leveraging modality integrity and adaptive cross-modal attention to maximize both the reliability and informativeness of fused predictions (Li et al., 21 Nov 2025).
1. Theoretical Foundation and Design Objectives
A primary objective of Integrity-guided Adaptive Fusion is to improve MSA robustness by dynamically prioritizing modalities that are empirically most "complete" and semantically well-recovered at fusion time. The mechanism fulfills three guiding principles:
- Dynamic Modality Prioritization: At each prediction step, select the dominant modality whose observed tokens are least masked or corrupted and whose reconstructed embedding most closely aligns with ground-truth features.
- Fallback to Auxiliary Modalities: When the dominant modality's integrity or feature quality is suboptimal, IF enables the model to rely more heavily on semantic cues available in auxiliary modalities.
- Cross-modal Attention Mechanism: Rather than relying on concatenation or static weighting, IF applies cross-modal attention, letting data-driven resemblance between query and key representations determine fusion weights.
Completeness is operationalized via a learned integrity score for each modality , predicting token unmasking at inference. Quality is enforced via dual reconstruction losses at the semantic and feature levels.
2. Mathematical Formulation
2.1 Integrity Scoring and Loss
For each modality , the integrity estimation head (2-layer Transformer + linear projection) computes
where are incomplete embeddings and is a learned [IE] token.
The module is supervised by
with .
2.2 Reconstruction Quality Regularization
No explicit scalar quality score is defined; instead, dual-depth completion is enforced via:
- Feature-level MSE:
- Semantic-level MSE:
- Feature/semantic-level MI losses:
These drive shared and completed features toward maximally reconstructing both fine-grained and semantically structured cues required for sentiment prediction.
2.3 Dominant Modality Selection
Batchwise mean integrity guides selection:
The dominant modality is , with remaining modalities as auxiliaries.
2.4 Adaptive Cross-Modal Attention for Fusion
Dominant features are processed by two Transformer layers, yielding . For three fusion steps:
- Queries from transformed dominant features
- Keys/values from auxiliary modality surrogates
- Attention weights:
- Fused representations updated as:
Following fusion, special [CLS]-style tokens are prepended and the sequence is passed to a cross-modal Transformer and linear prediction head.
2.5 Two-Stage Progressive Training
- Stage 1: Minimize (with frozen predictor)
- Stage 2: Optimize
Only after the initial stage are all model parameters, including fusion and predictor, trained jointly.
3. Architectural Composition and Data Flow
The mechanistic workflow of IF within Senti-iFusion is as follows:
- Multimodal inputs undergo random inter- and intra-modality masking to produce .
- Embedding encoders generate modality-specific embeddings .
- The integrity estimation module outputs integrity scores .
- The completion module:
- Disentangles each into shared and private features.
- Constructs surrogate .
- Decodes surrogates and re-encodes to apply dual-depth losses.
- Integrity-guided Adaptive Fusion:
- Aggregates scores over the batch and selects the dominant modality.
- Processes this dominant modality through Transformer layers and applies attention with auxiliary modalities using learned keys/values.
- Updates the intermediate fused representation for three fusion rounds.
- The concatenated [CLS]-like tokens and representations are input to a cross-modal Transformer, with the resulting pooled token mapped to the final sentiment prediction.
A summary of module sequence and functions is given below:
| Step | Module | Key Output / Action |
|---|---|---|
| 1 | Embedding Encoder () | Incomplete modality embeddings () |
| 2 | Integrity Estimation () | Integrity scores () |
| 3 | Completion/Disentanglement | Shared and private features; surrogates |
| 4 | Dual-depth Loss Application | Semantic and feature-level consistency |
| 5 | IF Module | Fusion via dominant selection + cross-attention |
| 6 | Prediction Head | Sentiment prediction () |
4. Algorithmic Steps and Implementation
The pseudocode for IF is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
for m in {l, a, v}: hat_u_m = E_emb^m(Concat([E_], tilde_U_m)) hat_I_m = E_ie^m(Concat([I_], hat_u_m)) # integrity score (hat_h_m^s, hat_h_m^p) = (E^s_m(hat_u_m), E^p_m(hat_u_m)) Compute L_rec^{enc} via similarity & difference losses for m in {l, a, v}: hat_h_m^{sur} = hat_I_m * hat_u_m + (1 - hat_I_m) * sum_{n != m} hat_h_n^s tilde_u_m = D_m(hat_h_m^{sur}) hat_h_m^{s,rec} = E^s_m(tilde_u_m) Compute dual-depth L_rec^{dec} (MSE + MI) if epoch <= pretrain_epochs: L = alpha * L_ie + beta * L_rec # update integrity & completion modules only else: # Adaptive fusion mu_m = mean over batch of hat_I_m dom = argmax_m mu_m h_dom = hat_h_dom^{sur} for i in {2, 3}: h_dom = E_dom^i(h_dom) h_fuse = h_dom for j in {1, 2, 3}: Q = h_dom @ W_Q for each auxiliary m != dom: K_m = hat_h_m^{sur} @ W_K^m V_m = hat_h_m^{sur} @ W_V^m gamma_m = softmax(Q @ K_m.T / sqrt(d_k)) h_fuse = h_fuse + sum_m gamma_m * V_m H_dom = Concat([D_], h_dom) H_fuse = Concat([F_], h_fuse) h_pred = E_pred(H_dom, H_fuse) y_hat = Linear(h_pred) L_pred = MSE(y_hat, y_true) L = alpha * L_ie + beta * L_rec + sigma * L_pred # update all modules end-to-end end |
Exact training configurations include input length , hidden dimension , batch size 64, AdamW optimizer with learning rate and weight decay , cosine annealing, warm-up, and early stopping. Loss weights: , , ; decoder weights , , , (Li et al., 21 Nov 2025).
5. Empirical Results and Ablation Findings
Empirical evaluations demonstrate that IF consistently enhances robustness under modality-missingness, particularly when compared to static fusion approaches. Under a drop rate on MOSI and MOSEI datasets:
- Omitting integrity-weighted surrogates ("w/o IIR") increases MAE from and lowers F1.
- Eliminating the integrity loss ("w/o ") reduces Acc-7 by 3\% (MOSI).
- Removing dual-depth completion loss ("w/o ") decreases F1 by $0.01-0.02$.
- Bypassing the two-stage strategy degrades both MAE and accuracy.
Robustness analysis under extreme missingness (drop rate ) reveals that IF prevents the model from collapsing to majority-class prediction, preserving fine-grained prediction performance (lower MAE, higher ACC-5 and F1). The essentiality of integrity-guided fusion is established by these ablations; it is the component that enables reliable adaptation to modality reliability at test time, outperforming competitive baselines in all missing-modality settings (Li et al., 21 Nov 2025).
6. Implications and Significance
Integrity-guided Adaptive Fusion redefines modality fusion as an integrity- and quality-aware, sequence-level adaptive process. The mechanism’s dynamic selection and attention-driven fusion repeatedly prove essential for fine-grained sentiment prediction in the presence of substantial multimodal noise or missingness. A plausible implication is that related multimodal inference tasks (e.g., action recognition, emotion detection) may similarly benefit from incorporating both explicit integrity estimation and cross-modal completion with attention, rather than relying on static or imputed representations.
The modular two-stage training—first focusing on stable integrity/quality modeling, then end-to-end fusion—contributes to training stability and overall accuracy. As demonstrated, IF maintains granular predictive performance even as competing techniques degrade under increasing input uncertainty (Li et al., 21 Nov 2025).