A Survey on Large Language Models in Multimodal Recommender Systems

Published 14 May 2025 in cs.IR and cs.CL | (2505.09777v1)

Abstract: Multimodal recommender systems (MRS) integrate heterogeneous user and item data, such as text, images, and structured information, to enhance recommendation performance. The emergence of LLMs introduces new opportunities for MRS by enabling semantic reasoning, in-context learning, and dynamic input handling. Compared to earlier pre-trained LLMs (PLMs), LLMs offer greater flexibility and generalisation capabilities but also introduce challenges related to scalability and model accessibility. This survey presents a comprehensive review of recent work at the intersection of LLMs and MRS, focusing on prompting strategies, fine-tuning methods, and data adaptation techniques. We propose a novel taxonomy to characterise integration patterns, identify transferable techniques from related recommendation domains, provide an overview of evaluation metrics and datasets, and point to possible future directions. We aim to clarify the emerging role of LLMs in multimodal recommendation and support future research in this rapidly evolving field.

Abstract PDF Upgrade to Chat

Summary

The paper presents a taxonomy of LLM integration mechanisms—prompting, training adaptation, and modality fusion—that overcome traditional recommender system challenges.
It details key prompting strategies (hard, soft, hybrid, and control logic) to enable adaptive, efficient multi-task recommendations.
The survey highlights empirical advances in semantic reasoning, cross-modal alignment, and LLM-based evaluation, driving future research directions.

LLMs in Multimodal Recommender Systems: A Comprehensive Survey

Introduction and Motivation

The paper "A Survey on LLMs in Multimodal Recommender Systems" (2505.09777) offers an exhaustive review of how LLMs—including models like GPT-3, PaLM, LLAMA—are reshaping the landscape of Multimodal Recommender Systems (MRS). MRS are characterized by the integration of heterogeneous data types such as text, images, structured data, and behavioral logs to enhance personalized recommendation pipelines. Traditional approaches, notably collaborative filtering and modality-specific neural encoders, are often hindered by common issues—cold-start, data sparsity, and modality misalignment. The increasing prevalence and flexibility of LLMs enables semantic reasoning, in-context learning, and adaptive input handling, promising significant improvements in these domains.

Unlike prior surveys focused mainly on encoder architectures or loss-based fusion, this work introduces a taxonomy centered around LLM-specific integration mechanisms: prompting, training strategies, and data adaptation. This survey prioritizes novel LLM-driven paradigms and includes methods from related recommendation branches such as sequential and knowledge-aware recommendation, thus broadening the scope and applicability.

Prompting Strategies in LLM-Based MRS

Prompting is highlighted as a primary interface for LLMs in MRS, facilitating adaptation and rapid task transfer without retraining. The paper identifies multiple prompting strategies:

Hard Prompting: Employs fixed template instructions crafted via domain expertise. These are generally interpretable and suitable for black-box LLMs but are fragile to phrasing variations and scale poorly for complex multimodal inputs.
Soft Prompting: Involves injection of trainable, continuous embeddings (soft tokens) into the model input, enabling task-specific conditioning while keeping the base model weights frozen. This paradigm supports parameter-efficient adaptation and can encode semantic user/item IDs.
Hybrid Prompting: Combines structured hard prompts with learned soft prompts, balancing interpretability and adaptability, effective for tasks requiring both personalisation and domain alignment.
Control Logic Prompting: Implements multi-step, branching, or conditional flows to coordinate complex LLM reasoning across heterogeneous tasks (feature extraction, ranking, KG construction, etc.).
Figure 1: Classification of prompting strategies for LLMs in multimodal recommender systems.

These strategies are pivotal for addressing cold-start and multi-task scenarios, reducing infrastructure overhead, and supporting zero/few-shot learning.

Training and Adaptation Protocols

The survey presents diverse training protocols developed to mitigate computational and accessibility constraints posed by LLMs:

Parameter Adaptation: Fine-tuning subsets of LLM parameters or external networks using adapters (e.g., LoRA, QLoRA), instruction tuning, or hybrid pipelines. These methods provide high flexibility but require access to model weights.
Zero-Tuning Usage: Utilizes frozen LLMs in encoder mode or as black-box generators via prompting, facilitating profile generation, attribute extraction, or content enrichment with minimal adaptation.
Pretrained MLLMs: Leverages Multimodal LLMs (MLLMs) in frozen or lightly tuned states to process and integrate text, image, and structured inputs efficiently.
Agent-Based Approaches: Instantiates LLMs as autonomous agents capable of reasoning, planning, and interacting with tools or external environments (often via structured and recursive prompting).
Figure 2: Overview of training strategies for adapting LLMs in multimodal recommender systems.

These adaptation routes optimize scalability, support dynamic input handling, and reduce inference latency.

Data Type Adaptation and Modality Integration

Efficient conversion and alignment of non-text modalities for LLM consumption is critical. The paper categorizes adaptation techniques as follows:

Knowledge Graph (KG) Conversion: Transforms heterogeneous data into structured KG triplets, enabling semantic alignment and improved interpretability.
Semantic ID Conversion: Maps IDs to semantic token spaces via learned embeddings or natural language prompts, enhancing generalization and cold-start resilience.
Tabular-to-Text & Image Summarization: Converts structured/tabular data and image content to textual prompts or summaries, facilitating compatibility with text-only LLMs.
Behavioral Summarization: Reformulates user interaction histories into natural language, supporting preference modeling and downstream reasoning.
Prompt-Based & Adapted Multimodal Fusion: Aligns heterogeneous inputs via prompt engineering or fused embedding projections, enabling efficient multimodal reasoning.
Code-like Structural Conversion: Serializes inputs in formats like JSON or Python objects, offering a human-readable and programmatically parsable interface.
Figure 3: Strategies for adapting diverse modalities for integration with LLMs.

This taxonomy is crucial for enabling cross-modal learning without extensive architectural changes.

Disentangle, Alignment, and Fusion Mechanisms

The survey analyzes fundamental MRS methods through the lens of LLMs:

Disentanglement: Explicitly separates modality-specific and shared information in representation spaces, using contrastive learning, attention mechanisms, clustering, VAE-based methods, and architectural decompositions.
Figure 4: Disentanglement mechanisms and architectural categories.
Alignment: Projects heterogeneous modality representations into a joint semantic space, employing contrastive losses, variational inference, attention, adapter/projection networks, post-hoc refinement, RL-enhanced policies, and NLP-based instruction tuning.
Figure 5: Classification and mechanisms of alignment strategies in multimodal systems.
Fusion: Integrates modalities at different stages—early, intermediate, or late—often leveraging attention/co-attention or gating mechanisms, balancing modality independence with global aggregation.

This holistic treatment clarifies the synergy between LLM-driven prompting, adaptation, and modality integration, highlighting trade-offs in interpretability, resource demand, and performance.

Evaluation Metrics and Datasets

The paper synthesizes a comprehensive listing of evaluation metrics encompassing both traditional recommendation (Precision@K, Recall@K, NDCG, MAP, HR@K, AUC) and NLP-derived metrics (BLEU, ROUGE, METEOR, BERTScore, BLEURT, CLIPScore, Perplexity). Increasing adoption of LLM-based evaluation protocols signals a shift toward scalable, preference-sensitive assessments. An extensive catalog of datasets for multimodal recommender research is also provided, spanning domains such as e-commerce, fashion, academic, food, and multimedia.

Main Research Trends and Future Directions

Several strong claims and emerging directions are emphasized:

KGs as a Bridge for Multimodal Integration: KG conversion is positioned as a scalable solution for grounding LLMs, mitigating hallucination, and supporting structured reasoning.
Soft Prompting & Adapter Fusion: The combination of soft prompting and adapter-based tuning is projected to become a standard for modular and efficient multimodal integration, though further work is needed on fusion order and interference effects.
Attention Mechanism Synergy: The dynamic fusion of coarse and fine-grained attention based on critical topics is recommended for optimal feature integration.
Masking and Summarization: Masked contrastive learning and LLM-based summarization remain vital for alignment, particularly in low-resource or unsupervised settings.
MLLM Utilization: Despite inference overhead, hybrid usage of MLLMs for pretraining or enrichment paired with lightweight deployment architectures is advocated.
LLM-Based Evaluation: Automated evaluation by LLM judges is noted as a scalable substitute for human assessment, yet introduces risks in reproducibility and alignment.
Structured Inputs and Agent-Based Reasoning: Trends indicate increasing adoption of structured (JSON, Python class) prompts and retrieval-augmented generation (RAG) in agent orchestration, with implications for interpretability and controllability.

The survey asserts that, although integration of LLMs in MRS is expanding the design space, substantial gaps remain in deployment efficiency, real-time adaptation, and evaluation methodologies.

Conclusion

This survey systematically delineates the transformative role LLMs play in multimodal recommender systems. By transitioning from classic encoder-centric paradigms to reasoning, prompt-driven, and modular adaptation strategies, LLMs grant unprecedented flexibility in semantic representation, cross-modal reasoning, and real-time personalization. The paper’s taxonomy and synthesis provide a rigorous framework for understanding LLM–MRS interactions, highlighting both strong empirical gains and unresolved bottlenecks. Future developments in multimodal recommendation will depend on further research in efficient adaptation, agentic planning, dynamic data fusion, and scalable evaluation—grounded in the technical insights this survey presents.

Markdown Report Issue