Papers
Topics
Authors
Recent
Search
2000 character limit reached

MoMo: Shared Encoder for Multimodal Fusion

Updated 5 February 2026
  • The paper introduces a novel shared Transformer encoder that jointly processes text and images, achieving parameter-efficient multimodal integration.
  • MoMo employs modality tokens/features and a stagewise curriculum with masked, contrastive, and matching losses to optimize cross-modal learning.
  • Empirical results show that MoMo outperforms traditional dual-encoder systems, with notable improvements such as +81% and +94% recall gains in low-data regimes.

A shared encoder model, often referred to as "MoMo", denotes a class of architectures in which a single parameterized Transformer encoder is used to process multiple input modalities (notably text and image) jointly. The MoMo paradigm contrasts with classical dual-encoder systems that use independent networks per modality, instead seeking parameter efficiency, improved cross-modal generalization, and robust scaling behavior in data-limited regimes. Several variants have been proposed, distinguished by their modality-integration method, pretraining protocols, and task scope (Roy et al., 3 Mar 2025, Chada et al., 2023, Bao et al., 2021).

1. Unified Architecture and Modality Integration

A MoMo model is defined by its commitment to a single, shared encoder backbone. For multimodal input xMx_M associated with modality MM (e.g., M{text,image}M \in \{\text{text}, \text{image}\}), the data is first tokenized—text is segmented as word pieces, images as non-overlapping patches (Roy et al., 3 Mar 2025, Chada et al., 2023). Each token is projected to a vector embedding, typically eMRs×de_M \in \mathbb{R}^{s \times d}.

To inform the encoder about modality, two main mechanisms are used:

  • Modality Feature Vectors: A learnable vector vMRdmodv_M \in \mathbb{R}^{d_\text{mod}} is concatenated to each token's embedding, producing hM0=[[eM1;vM],,[eMs;vM]]h_M^0 = [ [e_M^1 ; v_M], \ldots, [e_M^s ; v_M] ], yielding sequence inputs of shape s×(d+dmod)s \times (d + d_\text{mod}) (Roy et al., 3 Mar 2025).
  • Modality Tokens: A special token mMRdm_M \in \mathbb{R}^d is prepended, resulting in hM0=[mM,eM1,,eMs]h_M^0 = [m_M, e_M^1, \ldots, e_M^s] and input of length s+1s+1 (Roy et al., 3 Mar 2025, Chada et al., 2023).

The shared Transformer encoder stack EE with LL identical layers processes hM0h_M^0 regardless of modality. The output of the "[CLS]" position, zM=EL(hM0)[CLS]z_M = E^L(h_M^0)[\text{CLS}], is used as a pooled representation for downstream objectives.

An optional architectural refinement is to wrap the shared encoder with shallow modality-specific layers (either before or after the shared stack), introducing a limited degree of specialization which can further improve retrieval or classification accuracy (Roy et al., 3 Mar 2025).

A third approach—embodied in the Mixture-of-Modality-Experts (MoME/MoMo) block—routes each token through a modality-specific feedforward sub-network (expert), with hard-gating based on the sequence type and Transformer depth. Lower layers use pure vision or language experts depending on token origin; upper layers can switch to a cross-modal expert for deep fusion (Bao et al., 2021).

2. Pretraining Objectives and Optimization

Shared encoder models employ a composite training regime integrating several losses:

  • Masked Modeling Losses: Masked image modeling (MIM) with 2\ell_2 pixel reconstruction over random patch subsets, and masked language modeling (MLM) with cross-entropy over deleted word pieces. These objectives promote unimodal representation robustness (Chada et al., 2023, Bao et al., 2021).
  • Cross-Modal Masked Modeling: In the context of concatenated image and text, tokens or patches from both modalities are masked simultaneously and reconstructed, enforcing joint-contextual learning (Chada et al., 2023).
  • Contrastive Loss: With paired (image, text) data, dual projections zIz_I, zTz_T are trained via a symmetric contrastive loss. The canonical form is:

Lcon=1Ni=1N[logexp(zIi,zTi/τ)j=1Nexp(zIi,zTj/τ)+logexp(zTi,zIi/τ)j=1Nexp(zTi,zIj/τ)]\mathcal{L}_\text{con} = -\frac{1}{N}\sum_{i=1}^N \left[ \log\frac{ \exp(\langle z_I^i, z_T^i \rangle / \tau) }{ \sum_{j=1}^N \exp(\langle z_I^i, z_T^j \rangle / \tau) } + \log\frac{ \exp(\langle z_T^i, z_I^i \rangle / \tau) }{ \sum_{j=1}^N \exp(\langle z_T^i, z_I^j \rangle / \tau) }\right]

where τ\tau is a trainable temperature (Roy et al., 3 Mar 2025, Chada et al., 2023, Bao et al., 2021).

  • Image–Text Matching Loss: A binary classifier is applied over the [CLS] embedding in bimodal input to distinguish matched from hard-negative pairs (Chada et al., 2023, Bao et al., 2021).

During pretraining, these objectives may be phased in a curriculum (pure vision, then vision+text, then joint multimodal), with a gradient accumulation scheme for simultaneous optimization across modalities to prevent catastrophic forgetting (Chada et al., 2023, Bao et al., 2021).

Optimization is typically performed with AdamW, using large-scale batches when allowed by compute, heavy data augmentation, mixed-precision arithmetic, and learning rate schedules with warm-up and cosine decay (Roy et al., 3 Mar 2025, Chada et al., 2023).

3. Training Procedures and Stagewise Curriculum

Shared encoder models are pretrained in multiple feedback-coupled stages:

  1. Image-Only Stage: Masked patch prediction on large image collections (e.g., ImageNet-1K, PMC-OA) for several epochs, solidifying visual representations (Chada et al., 2023, Roy et al., 3 Mar 2025, Bao et al., 2021).
  2. Unimodal Joint Stage: Simultaneous but modality-separated MIM and MLM, using separate batches or segment embeddings, warm-started from the vision backbone (Chada et al., 2023).
  3. Multimodal Joint Stage: Masked modeling and contrastive/matching losses computed over paired and unpaired text–image data. Random mixing and gradient accumulation are used at each step to preserve knowledge from all modalities (Chada et al., 2023, Bao et al., 2021).

In variants with expert gating (e.g., VLMo), parameters of the newly-introduced language (or vision) experts are first frozen and only updated for their dedicated modality during their stage. Joint fine-tuning unfreezes all parameters for multimodal learning in later stages (Bao et al., 2021).

4. Empirical Performance and Generalization

MoMo-type shared encoder models demonstrate superior or competitive empirical performance across several modalities and benchmarks while being data and parameter-efficient.

  • Parameter Efficiency: MoMo-Base achieves strong multimodal and unimodal results (e.g., 71.3% multimodal, 83.6% visual average, 81.8% language average) on diverse tasks with 110M parameters, outperforming FLAVA (241M parameters) given considerably less pretraining data (27M vs. 70M pairs) (Chada et al., 2023).
  • Retrieval and Classification: On in-domain and cross-domain medical retrieval, shared encoders with modality features surpass CLIP-style dual encoders (both at parity and with more parameters) in 7/8 and 5/8 tasks, respectively. Their advantage grows as labeled data shrinks (e.g., +81% and +94% Recall@200 over modality-specific baselines at 0.66M training pairs) (Roy et al., 3 Mar 2025).
  • Ablations: Early modality-specific specialization (attached before the shared encoder) consistently yields +2–3 points in retrieval; improper or missing modality identification degrades all performance dimensions (Roy et al., 3 Mar 2025). Simultaneous multimodal-and-unimodal training is critical for balanced task retention (Chada et al., 2023).
  • Scaling: Larger shared encoders yield increasing returns, with MoMo-Large (335M parameters) achieving higher VQA, COCO, and ImageNet scores, and comparable scaling seen in VLMo as switchable experts and data size increase (Chada et al., 2023, Bao et al., 2021).

A summary of key retrieval results across data scales is presented here (metrics: Recall@200; from (Roy et al., 3 Mar 2025)):

Train Size (M) Model Img→Txt@200 Txt→Img@200 ΔRel Img2Txt ΔRel Txt2Img
1.74 Modality-Specific 0.4643 0.4663
Shared Encoder 0.4718 0.4721 +1.62% +1.24%
0.66 Modality-Specific 0.1333 0.1220
Shared Encoder 0.2413 0.2369 +81.02% +94.18%
0.33 Modality-Specific 0.0207 0.0186
Shared Encoder 0.0242 0.0241 +16.91% +29.57%

5. Design Variants and Extensions

Several architectural enhancements have been introduced:

  • Expert Routing: VLMo’s MoME block deploys separate FFNs for vision, language, and cross-modal inputs, hard-gating tokenwise for efficient yet flexible modality fusion. Bottom layers process tokens by source (vision/language), upper layers converge to joint cross-modal experts for deep interaction (Bao et al., 2021).
  • Modular Encoders for Lifelong Learning: Inspired by advances in modular multilingual NMT, hybrid designs allow fixed shared trunks with plug-and-play per-modality or per-language heads/tails, supporting graceful extension and minimal negative interference (Escolano et al., 2020).
  • Decoder Usage: Certain variants employ shallow decoders during pretraining (mask prediction), which are discarded at fine-tuning, emphasizing efficient representation learning in the encoder (Chada et al., 2023).

Most tested MoMo frameworks currently handle only text and image, but have “intrinsic extensibility” to other data types by learning additional modality IDs or segment tags (Roy et al., 3 Mar 2025, Chada et al., 2023).

6. Advantages, Limitations, and Outlook

The MoMo shared encoder paradigm presents several advantages:

  • Parameter and Compute Efficiency: Single-encoder models are substantially smaller and faster than dual/triple-path systems, leading to lower inference cost and feasible large-batch training (Roy et al., 3 Mar 2025, Chada et al., 2023).
  • Generalization in Low-Data Regimes: Shared inductive biases and cross-modal co-training confer a significant generalization edge, particularly with limited annotated data (Roy et al., 3 Mar 2025).
  • Seamless Modality Integration: Incorporation of new modalities, in principle, is straightforward via dedicated tokens or vectors, allowing extension to non-vision/text domains (Roy et al., 3 Mar 2025, Escolano et al., 2020).

Limitations are also evident:

  • Underfitting Highly Specialized Features: Purely shared backbones may insufficiently model exclusive modality dynamics; early modality-specific layers can partially mitigate this (Roy et al., 3 Mar 2025).
  • Limited Modal Scope: Most current instantiations address only bimodal (vision-language) data; extension to audio, video, or structured modalities remains open (Chada et al., 2023).
  • Curriculum and Masking Schedules: Uniform masking ratios and naive training schedules may not be globally optimal per task or modality. Stagewise, curriculum-learned curricula could offer gains (Chada et al., 2023).
  • Application Scope: Generative or multi-task settings with a shared encoder are as yet underexplored (Roy et al., 3 Mar 2025).

Empirical results suggest that shared encoder models, if trained with appropriately careful curriculum and regularization, achieve robust cross-modal performance, competitive or superior to much larger modular systems (Chada et al., 2023, Bao et al., 2021). The shared encoder research trajectory continues toward richer modality tags, more sophisticated expert gating, and comprehensive multi-modal scalability.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MoMo: A Shared Encoder Model.