Multimodal-to-Text Generation Models

Updated 6 February 2026

Multimodal-to-text generation models are neural architectures that encode and align heterogeneous signals, transforming images, video, audio, and structured data into natural language.
They integrate modality-specific encoders with unified token stream transformers or conditional prefixing to support tasks like image captioning, video dialogue, and citation-grounded summarization.
Recent advances emphasize efficient cross-modal fusion, dynamic expert selection, and robust mechanisms for faithful content grounding and attributions.

Multimodal-to-text generation models are neural architectures that map heterogeneous multimodal signals—such as images, video, audio, structured data, or sequences thereof—to textual outputs drawn from natural language. Unlike unimodal text generation, these models must simultaneously encode, align, and fuse information from two or more distinct input modalities within a unified generative process. Advances in this area are central to tasks such as image-to-text captioning, video-based dialog, music captioning, experience-driven story and lyric generation, multimodal citation-grounded summarization, and more. This article surveys fundamental principles, contemporary architectures, training methodologies, evaluation paradigms, and open challenges for multimodal-to-text generation, referencing recent academic contributions.

Multimodal-to-text generation models can be organized into several major design paradigms depending on how they process and align inputs:

a. Unified Token Stream Transformers:

Transformer-based models increasingly adopt a joint tokenization strategy, where visual, textual, and, more recently, music or motion inputs are quantized into shared discrete token streams for early fusion. For example, ANOLE (Chern et al., 2024) and the unified model in (Huang et al., 2021) (building on Chameleon and LXMERT/X-LXMERT, respectively) concatenate image and text tokens (with modality markers) into a single autoregressive sequence processed by a standard Transformer decoder. This enables native multimodal-to-text and text-to-image generation via shared embeddings, causal self-attention, and a unified language modeling objective.

b. Modality-specific Encoding and Prefixing:

Other models employ modality-specific encoders (e.g., ResNet for vision, ViT for visual tokens, or audio classifiers), project all encodings to a common language embedding space, and concatenate them as a conditional prefix for a pretrained or jointly-trained text decoder. This conditionality prefix approach exemplified by MAnTiS (Sollami et al., 2021) and the video-centric Vx2Text (Lin et al., 2021), reduces architectural complexity by keeping the downstream decoder unchanged and leveraging language modeling advances directly.

c. Multi-stream and Experience-aware Architectures:

Innovations such as Multi-Modal Experience Inspired AI Creation (MMTG) (Cao et al., 2022) extend the paradigm to ordered, sequential experiences: image–text pairs observed over time are processed by multi-stream encoder–decoders using cross-modal, temporal, and attention mechanisms that explicitly model the dependencies between input sequence order and generated narrative. The architectural design fuses each experience’s latent representation over the set of output sentences using a two-dimensional attention with a temporal regularizer, enabling sensitive modeling of narrative flow.

d. Shared Generative-Embedding Models:

Models such as MM-GEM (Ma et al., 2024) demonstrate that a single LLM backbone with lightweight projection adapters and a PoolAggregator can simultaneously solve both embedding and generative (captioning) objectives, including at fine-grained region level, without significant conflict or capacity interference between tasks.

2. Training Objectives and Multimodal Fusion Methods

Most multimodal-to-text generation models pursue autoregressive maximum-likelihood training over the textual output sequence, often conditioned on learned or projected representations of one or more non-text inputs. The following mechanisms are prevalent:

Cross-entropy Language Modeling:

The standard loss is the negative log likelihood of the ground-truth tokens given the multimodal context (e.g., Stage 3 GPT-2/StoryGen (Jiang, 2020); text head in TextHarmony (Zhao et al., 2024); Stage 3 T5-style captioning in UniMuMo (Yang et al., 2024)).

Contrastive and Joint Objectives:

Architectures such as MM-GEM (Ma et al., 2024) interleave contrastive embedding losses (image–text, text–image) with cross-entropy captioning loss, yielding a single training signal:

$\mathcal{L}_{\text{MM-GEM}} = \mathcal{L}_{\text{emb}} + \mathcal{L}_{\text{gen}}$

where embedding and generative heads are decoupled with lightweight adapters but share the main LLM.

Attention and Fusion Schemes:

Unified token streams enable early-fusion cross-modal attention (e.g., ANOLE, (Huang et al., 2021)), while more traditional models use cross-modal fusion via prefixing ((Sollami et al., 2021), MAnTiS), or concatenation after projection (Vx2Text). Two-level granularity—combining dense (continuous) and discrete features—enhances semantic coverage (Huang et al., 2021).

Dynamic Expert and Modality Selection:

TextHarmony (Zhao et al., 2024) employs Slide-LoRA—multiple parallel low-rank adaptation modules in Transformer layers gated by a dynamically-learned coefficient—to enable parameter-efficient, dynamically adaptable separation of text- and vision-specific generation without sacrificing the benefits of unified modeling.

3. Modalities Addressed: Images, Video, Music, and Beyond

As the field expands, modern architectures move well beyond static image captioning:

Video + Audio + Text:

Vx2Text (Lin et al., 2021) processes video frames, raw audio, speech transcripts, and dialog history as streams of tokens for generative video-based dialog and QA. Differentiable tokenization via Gumbel–Softmax enables end-to-end training, with all fusion performed in the language space.

Sequential Multimodal Experiences:

MMTG (Cao et al., 2022) attacks the problem of generating poetry or lyrics based on a temporally ordered sequence of multimodal experiences, explicitly encoding both sequence and modality dependencies, and employs a curriculum negative sampling strategy for efficient optimization.

Music and Motion:

UniMuMo (Yang et al., 2024) generalizes token-based fusion to music and motion. Both music and motion are encoded and quantized via a shared codebook, enabling a single Transformer architecture to process and generate music, motion, and text, including cross-modal captioning (music→text, motion→text).

Document/Scientific Multimodal:

MAnTiS (Sollami et al., 2021) and MCiteBench (Hu et al., 4 Mar 2025) bring multimodal-to-text generation to complex contexts such as product descriptions (image+title) and scientific discourse (image, table, paragraph, etc. with citation grounding).

4. Controllability, Faithfulness, and Attribution

Robust multimodal-to-text generation must provide mechanisms for both controllability and faithfulness:

Plug-and-play Multimodal Controllability:

ZeroGen (Tu et al., 2023) introduces a zero-shot paradigm wherein both token-level (textual oracle, e.g. GloVe similarity) and sentence-level (multimodal oracle, e.g. CLIP image-text alignment) controls are imposed directly at decoding, without additional training. Dynamic control weights at each time step calibrate the influence of content-specific and modality-specific signals, yielding both high-quality captioning and attribute-controlled news generation.

Faithful and Salient Generation:

A hybrid critic–generator framework (Hashem et al., 2024) augments LMMs with a vision critic (BLIP-2 + LoRA) that categorizes draft text features as salient, non-salient, or hallucinated, pruning non-grounded content and appending missing salient features in post-editing. This yields significant improvements in BLEU, METEOR, ROUGE, BERTScore, and CLIPScore, and is essential for domains requiring faithful grounded descriptions, such as advertising.

Citation and Attribution in Multimodal Scientific QA:

MCiteBench (Hu et al., 4 Mar 2025) evaluates MLLMs' abilities to generate answer text with fine-grained, per-sentence citations to multimodal evidence (text, figures, tables). Analyses reveal a strong textual modality bias; even large models under-attend to visual content when assigning citations, resulting in lower attribution fidelity for visual versus textual evidence.

5. Datasets, Evaluation, and Benchmarking

Model capabilities are measured on diverse datasets and by a range of metrics:

Generation Benchmarks:

MS-COCO (image captioning, BLEU-4, CIDEr, SPICE) (Huang et al., 2021, Ma et al., 2024), Flickr8K/Flickr30k (Jiang, 2020, Ma et al., 2024), MusicQA (Yang et al., 2024), HumanML3D (motion captioning) (Yang et al., 2024), FACAD (fashion captioning) (Sollami et al., 2021).

Faithfulness and Saliency:

CLIPScore (image–text alignment), BERTScore, BLEU, METEOR, ROUGE, custom saliency matching (Hashem et al., 2024).

Citation Quality:

MCiteBench measures Citation F₁, Source Reliability (F₁, Exact Match), and LLM-judged answer accuracy. Experience-aware lyric/story generation models are further evaluated by metrics capturing output diversity and order-sensitivity (NNR), and by detailed human relevance/coherence/meaning/overall panel ratings (Cao et al., 2022).

Region-level Fine-grained Evaluation:

MM-GEM (Ma et al., 2024) supports region-level captioning and retrieval, measured by Recall@1 on DCI and R-Precision (for HumanML3D).

6. Practical Considerations, Limitations, and Future Directions

Parameter and Computational Efficiency:

Recent models demonstrate parameter efficiency via modular adapters (Slide-LoRA (Zhao et al., 2024)), head-only fine-tuning (ANOLE (Chern et al., 2024)), and multi-stage minimal retraining for both global and region-level tasks (MM-GEM (Ma et al., 2024)).

Limitations:

Modular, multi-stage pipelines (as in StoryGen (Jiang, 2020)) are not end-to-end differentiable, leading to error propagation.
Token-based models can struggle with fine spatial or temporal alignment (either visual or in music/motion streams (Yang et al., 2024)).
Faithful grounding and citation in the presence of numerous distractors and modality-mismatch remains unsolved (MCiteBench (Hu et al., 4 Mar 2025)).
Visual-text fusion is still subject to textual grounding bias (Hu et al., 4 Mar 2025).

Areas for Advancement:

End-to-end unimodal and multimodal fusion; explicit cross-modal attention for deeper joint representations.
Better alignment of music, motion, and text features (UniMuMo (Yang et al., 2024)).
Unified expert mechanisms to minimize destructive cross-modal interference (TextHarmony (Zhao et al., 2024)).
Inclusion of additional modalities (audio, structured data, code (Hashem et al., 2024, Hu et al., 4 Mar 2025)).
Direct reinforcement or contrastive objectives for more robust cross-modal grounding (Hashem et al., 2024, Hu et al., 4 Mar 2025).

7. Comparative Summary and Research Outlook

The table below summarizes representative models and distinguishing features:

Model	Modalities	Fusion/Tokenization	Faithfulness/Control	Evaluation Highlights
ANOLE (Chern et al., 2024)	Image, Text	Shared token stream	Head-only image tuning	Qualitative, interleaved gen.
MM-GEM (Ma et al., 2024)	Image, Text	PoolAggregator + shared LLM	Embedding+gen dual obj.	CIDEr 110.9 (COCO), region R@1
Vx2Text (Lin et al., 2021)	Video, Audio, Text	Differentiable tokenizer	Joint caption/QA/dialog	SOTA, AVSD/TVC/TVQA
TextHarmony (Zhao et al., 2024)	Images, Text	Slide-LoRA, shared LLM	Dynamic MoE expert gating	OCRBench, NED, FID, CLIP
ZeroGen (Tu et al., 2023)	Image, Text	Decoding-time oracles	Plug-and-play control	B@4 15.5, CIDEr 55.4
MMTG (Cao et al., 2022)	Seq. img+text	2D cross-modal attention	Curriculum negative sampling	BLEU-2 0.076, BERTScore 0.595
UniMuMo (Yang et al., 2024)	Music, Motion, Text	Shared codebook+Transformer	Parallel music-motion gen	BLEU@1 0.261, MMDist 2.958
MCiteBench (Hu et al., 4 Mar 2025)	Text, Fig., Table	Multimodal encoding	Citation F₁, Source EM	C-F₁ ∼0.84 (text), EM <0.1 (fig)

Ongoing research emphasizes the need for unified architectures offering flexible, faithful, and controllable multimodal-to-text generation, robust to modality imbalance and capable of fine-grained grounding with attribution, as well as end-to-end differentiable training and efficient fine-tuning. Emerging benchmarks—such as MCiteBench—will be critical for rigorous, modality-aware evaluation and for driving progress on fine-grained grounding in real-world scientific and technical applications.