MoMo: Shared Encoder for Multimodal Fusion

Updated 5 February 2026

The paper introduces a novel shared Transformer encoder that jointly processes text and images, achieving parameter-efficient multimodal integration.
MoMo employs modality tokens/features and a stagewise curriculum with masked, contrastive, and matching losses to optimize cross-modal learning.
Empirical results show that MoMo outperforms traditional dual-encoder systems, with notable improvements such as +81% and +94% recall gains in low-data regimes.

A shared encoder model, often referred to as "MoMo", denotes a class of architectures in which a single parameterized Transformer encoder is used to process multiple input modalities (notably text and image) jointly. The MoMo paradigm contrasts with classical dual-encoder systems that use independent networks per modality, instead seeking parameter efficiency, improved cross-modal generalization, and robust scaling behavior in data-limited regimes. Several variants have been proposed, distinguished by their modality-integration method, pretraining protocols, and task scope (Roy et al., 3 Mar 2025, Chada et al., 2023, Bao et al., 2021).

1. Unified Architecture and Modality Integration

A MoMo model is defined by its commitment to a single, shared encoder backbone. For multimodal input $x_M$ associated with modality $M$ (e.g., $M \in \{\text{text}, \text{image}\}$ ), the data is first tokenized—text is segmented as word pieces, images as non-overlapping patches (Roy et al., 3 Mar 2025, Chada et al., 2023). Each token is projected to a vector embedding, typically $e_M \in \mathbb{R}^{s \times d}$ .

To inform the encoder about modality, two main mechanisms are used:

Modality Feature Vectors: A learnable vector $v_M \in \mathbb{R}^{d_\text{mod}}$ is concatenated to each token's embedding, producing $h_M^0 = [ [e_M^1 ; v_M], \ldots, [e_M^s ; v_M] ]$ , yielding sequence inputs of shape $s \times (d + d_\text{mod})$ (Roy et al., 3 Mar 2025).
Modality Tokens: A special token $m_M \in \mathbb{R}^d$ is prepended, resulting in $h_M^0 = [m_M, e_M^1, \ldots, e_M^s]$ and input of length $s+1$ (Roy et al., 3 Mar 2025, Chada et al., 2023).

The shared Transformer encoder stack $E$ with $L$ identical layers processes $h_M^0$ regardless of modality. The output of the "[CLS]" position, $z_M = E^L(h_M^0)[\text{CLS}]$ , is used as a pooled representation for downstream objectives.

An optional architectural refinement is to wrap the shared encoder with shallow modality-specific layers (either before or after the shared stack), introducing a limited degree of specialization which can further improve retrieval or classification accuracy (Roy et al., 3 Mar 2025).

A third approach—embodied in the Mixture-of-Modality-Experts (MoME/MoMo) block—routes each token through a modality-specific feedforward sub-network (expert), with hard-gating based on the sequence type and Transformer depth. Lower layers use pure vision or language experts depending on token origin; upper layers can switch to a cross-modal expert for deep fusion (Bao et al., 2021).

2. Pretraining Objectives and Optimization

Shared encoder models employ a composite training regime integrating several losses:

Masked Modeling Losses: Masked image modeling (MIM) with $\ell_2$ pixel reconstruction over random patch subsets, and masked language modeling (MLM) with cross-entropy over deleted word pieces. These objectives promote unimodal representation robustness (Chada et al., 2023, Bao et al., 2021).
Cross-Modal Masked Modeling: In the context of concatenated image and text, tokens or patches from both modalities are masked simultaneously and reconstructed, enforcing joint-contextual learning (Chada et al., 2023).
Contrastive Loss: With paired (image, text) data, dual projections $z_I$ , $z_T$ are trained via a symmetric contrastive loss. The canonical form is:

$\mathcal{L}_\text{con} = -\frac{1}{N}\sum_{i=1}^N \left[ \log\frac{ \exp(\langle z_I^i, z_T^i \rangle / \tau) }{ \sum_{j=1}^N \exp(\langle z_I^i, z_T^j \rangle / \tau) } + \log\frac{ \exp(\langle z_T^i, z_I^i \rangle / \tau) }{ \sum_{j=1}^N \exp(\langle z_T^i, z_I^j \rangle / \tau) }\right]$

where $\tau$ is a trainable temperature (Roy et al., 3 Mar 2025, Chada et al., 2023, Bao et al., 2021).

Image–Text Matching Loss: A binary classifier is applied over the [CLS] embedding in bimodal input to distinguish matched from hard-negative pairs (Chada et al., 2023, Bao et al., 2021).

During pretraining, these objectives may be phased in a curriculum (pure vision, then vision+text, then joint multimodal), with a gradient accumulation scheme for simultaneous optimization across modalities to prevent catastrophic forgetting (Chada et al., 2023, Bao et al., 2021).

Optimization is typically performed with AdamW, using large-scale batches when allowed by compute, heavy data augmentation, mixed-precision arithmetic, and learning rate schedules with warm-up and cosine decay (Roy et al., 3 Mar 2025, Chada et al., 2023).

3. Training Procedures and Stagewise Curriculum

Shared encoder models are pretrained in multiple feedback-coupled stages:

Image-Only Stage: Masked patch prediction on large image collections (e.g., ImageNet-1K, PMC-OA) for several epochs, solidifying visual representations (Chada et al., 2023, Roy et al., 3 Mar 2025, Bao et al., 2021).
Unimodal Joint Stage: Simultaneous but modality-separated MIM and MLM, using separate batches or segment embeddings, warm-started from the vision backbone (Chada et al., 2023).
Multimodal Joint Stage: Masked modeling and contrastive/matching losses computed over paired and unpaired text–image data. Random mixing and gradient accumulation are used at each step to preserve knowledge from all modalities (Chada et al., 2023, Bao et al., 2021).

In variants with expert gating (e.g., VLMo), parameters of the newly-introduced language (or vision) experts are first frozen and only updated for their dedicated modality during their stage. Joint fine-tuning unfreezes all parameters for multimodal learning in later stages (Bao et al., 2021).

4. Empirical Performance and Generalization

MoMo-type shared encoder models demonstrate superior or competitive empirical performance across several modalities and benchmarks while being data and parameter-efficient.

Parameter Efficiency: MoMo-Base achieves strong multimodal and unimodal results (e.g., 71.3% multimodal, 83.6% visual average, 81.8% language average) on diverse tasks with 110M parameters, outperforming FLAVA (241M parameters) given considerably less pretraining data (27M vs. 70M pairs) (Chada et al., 2023).
Retrieval and Classification: On in-domain and cross-domain medical retrieval, shared encoders with modality features surpass CLIP-style dual encoders (both at parity and with more parameters) in 7/8 and 5/8 tasks, respectively. Their advantage grows as labeled data shrinks (e.g., +81% and +94% Recall@200 over modality-specific baselines at 0.66M training pairs) (Roy et al., 3 Mar 2025).
Ablations: Early modality-specific specialization (attached before the shared encoder) consistently yields +2–3 points in retrieval; improper or missing modality identification degrades all performance dimensions (Roy et al., 3 Mar 2025). Simultaneous multimodal-and-unimodal training is critical for balanced task retention (Chada et al., 2023).
Scaling: Larger shared encoders yield increasing returns, with MoMo-Large (335M parameters) achieving higher VQA, COCO, and ImageNet scores, and comparable scaling seen in VLMo as switchable experts and data size increase (Chada et al., 2023, Bao et al., 2021).

A summary of key retrieval results across data scales is presented here (metrics: Recall@200; from (Roy et al., 3 Mar 2025)):

Train Size (M)	Model	Img→Txt@200	Txt→Img@200	ΔRel Img2Txt	ΔRel Txt2Img
1.74	Modality-Specific	0.4643	0.4663	–	–
	Shared Encoder	0.4718	0.4721	+1.62%	+1.24%
0.66	Modality-Specific	0.1333	0.1220	–	–
	Shared Encoder	0.2413	0.2369	+81.02%	+94.18%
0.33	Modality-Specific	0.0207	0.0186	–	–
	Shared Encoder	0.0242	0.0241	+16.91%	+29.57%

5. Design Variants and Extensions

Several architectural enhancements have been introduced:

Expert Routing: VLMo’s MoME block deploys separate FFNs for vision, language, and cross-modal inputs, hard-gating tokenwise for efficient yet flexible modality fusion. Bottom layers process tokens by source (vision/language), upper layers converge to joint cross-modal experts for deep interaction (Bao et al., 2021).
Modular Encoders for Lifelong Learning: Inspired by advances in modular multilingual NMT, hybrid designs allow fixed shared trunks with plug-and-play per-modality or per-language heads/tails, supporting graceful extension and minimal negative interference (Escolano et al., 2020).
Decoder Usage: Certain variants employ shallow decoders during pretraining (mask prediction), which are discarded at fine-tuning, emphasizing efficient representation learning in the encoder (Chada et al., 2023).

Most tested MoMo frameworks currently handle only text and image, but have “intrinsic extensibility” to other data types by learning additional modality IDs or segment tags (Roy et al., 3 Mar 2025, Chada et al., 2023).

6. Advantages, Limitations, and Outlook

The MoMo shared encoder paradigm presents several advantages:

Parameter and Compute Efficiency: Single-encoder models are substantially smaller and faster than dual/triple-path systems, leading to lower inference cost and feasible large-batch training (Roy et al., 3 Mar 2025, Chada et al., 2023).
Generalization in Low-Data Regimes: Shared inductive biases and cross-modal co-training confer a significant generalization edge, particularly with limited annotated data (Roy et al., 3 Mar 2025).
Seamless Modality Integration: Incorporation of new modalities, in principle, is straightforward via dedicated tokens or vectors, allowing extension to non-vision/text domains (Roy et al., 3 Mar 2025, Escolano et al., 2020).

Limitations are also evident:

Underfitting Highly Specialized Features: Purely shared backbones may insufficiently model exclusive modality dynamics; early modality-specific layers can partially mitigate this (Roy et al., 3 Mar 2025).
Limited Modal Scope: Most current instantiations address only bimodal (vision-language) data; extension to audio, video, or structured modalities remains open (Chada et al., 2023).
Curriculum and Masking Schedules: Uniform masking ratios and naive training schedules may not be globally optimal per task or modality. Stagewise, curriculum-learned curricula could offer gains (Chada et al., 2023).
Application Scope: Generative or multi-task settings with a shared encoder are as yet underexplored (Roy et al., 3 Mar 2025).

Empirical results suggest that shared encoder models, if trained with appropriately careful curriculum and regularization, achieve robust cross-modal performance, competitive or superior to much larger modular systems (Chada et al., 2023, Bao et al., 2021). The shared encoder research trajectory continues toward richer modality tags, more sophisticated expert gating, and comprehensive multi-modal scalability.