Large Multimodal Models (LMMs)

Updated 19 February 2026

Large Multimodal Models (LMMs) are neural architectures that integrate diverse modalities such as images, text, and more through cross-attention and fusion modules for unified reasoning.
They employ specialized adapters, projection layers, and benchmark evaluations to enhance tasks like visual question answering, diagrammatic mathematical reasoning, and embodied control.
Recent advancements focus on low-resource language adaptation, continual learning with minimal forgetting, and improved explainability, while addressing challenges like spatial grounding and prompt sensitivity.

Large Multimodal Models (LMMs) are a class of neural architectures designed for simultaneous processing, alignment, and reasoning over multiple types of input modalities, most notably vision (images, video), language (text), and in some specialized cases, additional streams such as audio, radio frequency signals, or structured tabular data. By tightly integrating visual and textual representations—often via transformers with cross-attention mechanisms or multimodal fusion modules—LMMs enable complex joint reasoning and instruction following that is unattainable by unimodal LLMs. Their development, evaluation, and application span a wide range of tasks, including general visual question answering (VQA), multimodal retrieval-augmented generation (RAG), mathematical reasoning with diagrams, embodied agent control, low-resource language transfer, and autonomous systems.

1. Architectural Foundations and Multimodal Integration

LMMs typically combine a vision encoder—such as a Vision Transformer (ViT) or convolutional neural network (CNN)—with a LLM (e.g., GPT, LLaMA), linked by a projection or adapter module that translates high-dimensional visual features into the LLM’s embedding space. Vision and text tokens are concatenated or interleaved, enabling the transformer stack to apply self- and cross-attention across all modalities (Yang et al., 2023). This architecture supports a broad range of input configurations: single images with text, interleaved image–text sequences, multiple images per prompt (for multi-image reasoning), and, in more advanced settings, other data types such as LiDAR or radio signals (Xu et al., 2024, Zhang et al., 13 Jan 2026).

Cross-modal fusion is realized via stacked attention layers, where queries, keys, and values may be drawn from any modality. For example, in wireless systems, encoders {E_RF, E_CSI, E_vis, E_tab} map various sensing and symbolic channels into a joint token space, allowing unified processing (Xu et al., 2024). More specialized LMMs for spatial and relational reasoning implement targeted interventions at the level of individual attention heads to transmit and manipulate concepts such as spatial relations (Fu et al., 2 Oct 2025).

Several LMMs are equipped with explicit adapters (e.g., two-layer MLPs), cross-attention bridges (as in Q-Formers), or LoRA-based adapters to facilitate low-rank parameter-efficient adaptation for domain-specific tasks (Liu et al., 2024, Yang et al., 23 Oct 2025). Some models maintain all backbone weights frozen and achieve significant performance gains by only training small segmentation or fusion heads, thereby preserving general instruction-following ability (Wu et al., 2024).

2. Evaluation Paradigms and Benchmarks

LMMs are assessed via a portfolio of benchmarks probing instance-level reasoning, spatial understanding, factual knowledge, and agentic interaction. Notable examples include:

MMKC-Bench: Designed to expose LMM susceptibilities to multimodal knowledge conflicts in retrieval-augmented generation pipelines, distinguishing among intra-memory (parametric), context-memory, and inter-context conflicts across entity recognition, visual semantics, and entity factual knowledge (2505.19509).
MMR: Evaluates deep reading and spatial comprehension in text-rich images through 11 tasks, revealing universal limitations of current models in spatial relation, text localization, and instruction following for structured output, even where extractive VQA is trivial (Chen et al., 2024).
MINED: Probes and updates time-sensitive knowledge, benchmarking six dimensions (cognition, temporal awareness, trustworthiness, implicit temporal understanding, reasoning, robustness). Performance reveals severe model deficits on historical intervals, implicit temporal concepts, and dynamic knowledge updating (Jiang et al., 22 Oct 2025).
VisioMath and CMM-Math: Address multimodal mathematical reasoning with diagrams, multi-image contexts, and fine-grained visual ambiguities, showing persistent errors in geometric, multi-image, and symbolically complex problems even for state-of-the-art LMMs (Li et al., 7 Jun 2025, Liu et al., 2024).
VisualAgentBench: Benchmarks LMMs as general interactive agents across embodied (robotics, GUI, visual design) scenarios, measuring planning ability, trajectory following (via behavior cloning), and robustness in multi-turn environments (Liu et al., 2024).

Extensive quantitative metrics are employed—accuracy, task-specific scoring (BLEU, CIDEr, SPICE), spatial overlap (IoU), Levenshtein similarity for text outputs, and custom composite metrics such as OAR/CAR/IAR for conflict behavior or CEM for exact temporal match (2505.19509, Jiang et al., 22 Oct 2025, Chen et al., 2024).

3. Task Specialization and Domain Adaptation

LMMs have been tailored for domains and regimes beyond general web imagery and language, with architectural and training modifications:

Wireless and 6G Systems: LMMs act as universal foundation models for AI-native networks, ingesting RF waveforms, CSI matrices, control tables, and images. Neuro-symbolic modules and RAG are used for causal physical grounding and dynamic reasoning, yielding enhancements in rationale quality and mathematical inference over vanilla LLMs (Xu et al., 2024, Yang et al., 23 Oct 2025).
Low-Resource Languages: Adaptation strategies include visual enhancement (e.g., image-guided translation, visual disambiguation), synthetic data augmentation (back-translation, diffusion models), cross-modal and cross-lingual transfer (using adapters or contrastive CLIP-like approaches), and advanced fusion mechanisms (early, late, and architectural) (Lupascu et al., 8 Feb 2025). Empirical results show notable BLEU, CIDEr, and accuracy improvements on image captioning, sentiment, and hate speech tasks.
Continual Learning and Catastrophic Forgetting: Continual instruction tuning has been benchmarked, revealing that standard LMMs suffer from substantial forgetting when exposed to sequential tasks. Data replay and model expansion (task-specific Q-Former heads) are more robust than regularization-based methods, especially in settings starting from unimodal pretraining (He et al., 2023).
Object Detection and Visual Grounding: LMM-Det eliminates specialist detection heads by adjusting training data distributions and inference prompts, leveraging pseudo-labels for proposal-rich supervision and boosting general-purpose LMMs’ recall and precision. Grounding in frozen LMMs (e.g., F-LMM) exploits internal attention maps with minimal parameter growth and no degradation in conversational or general VQA capabilities (Li et al., 24 Jul 2025, Wu et al., 2024).
Agentic and Embodied Control: Architectures for embodied driving, robotics, or agentic web/mobile interaction combine LMMs as perception modules with deep RL or planning policies, integrating visual, kinematic, and environmental data streams for closed-loop action (Zhang et al., 13 Jan 2026, Ishaq et al., 18 Mar 2025, Liu et al., 2024).

4. Conflict Resolution, Reasoning, and Knowledge Updating

A core challenge for LMMs is the arbitration among conflicting sources of information, particularly under RAG. MMKC-Bench demonstrates that LMMs overwhelmingly privilege parametric knowledge (OAR) over retrieved external evidence (CAR), especially for perceptual conflicts (entity recognition, visual semantics), and less so for factual attribute conflicts (EK) (2505.19509). Context-memory and inter-context conflicts increase CAR marginally, but internal knowledge remains dominant, regardless of model scale.

Detection of knowledge conflict is feasible (coarse/fine accuracy ~70–79%), but models exhibit higher detection on non-conflict samples, indicating bias and missed subtle tensions. Model scale correlates with greater parametric reliance and marginal improvements in external reasoning.

Efforts to update LMM knowledge, particularly for time-sensitive facts, utilize parameter-modifying (FT-LLM, FT-VIS, MEND) and parameter-preserving (SERAC, IKE) methods: single-instance updates reach near-perfect accuracy, but lifelong sequential editing induces catastrophic forgetting in parameter-based approaches, with only memory-based solutions retaining longevity (Jiang et al., 22 Oct 2025).

5. Explainability, Internal Concept Representation, and Interpretability

Recent advances expose the internal structure and semantic basis of LMM representations:

Concept-Based Explainability: Dictionary-learning approaches (via semi-NMF with $\ell_1$ sparsity) recover a small set of multimodal concepts—basis vectors in token representation space—faithfully grounded in both prototypical image clusters and decoded text tokens. These concepts provide a disentangled, interpretable account of LMM decision-making, vastly more compact than neuron-level attribution (Parekh et al., 2024).
Function Vector Modularity: For spatial and relational tasks, function vectors formed from key attention heads can be directly manipulated (injected, composed) to control and enhance relational reasoning—enabling analogy, zero-shot transfer, and modularity without modifying backbone weights (Fu et al., 2 Oct 2025).

This internal modularity enables plug-and-play adaptation, precise debugging, and new theoretical understanding of LMM compositionality.

6. Limitations, Failure Modes, and Open Problems

Comprehensive error analyses across benchmarks highlight deficiencies and persistent obstacles:

Spatial Reasoning and Visual Grounding: Models display marked drops for fine-grained spatial distinctions, text grounding, and multi-view contrast reasoning. Instruction-following for structured outputs, such as bounding box tuples, remains brittle (Chen et al., 2024, Li et al., 7 Jun 2025).
Multilingual and Cross-Lingual Reasoning: Open and even closed LMMs show strong language preferences, with significant degradation when text must be read and reasoned across scripts (e.g., images in Chinese, prompts in German) (Wang et al., 2024).
Mathematical and Symbolic Errors: LMMs misread axes, misinterpret symbols, or focus on a single image in multi-image settings, with pronounced accuracy degradation at higher school levels or in geometry-heavy contexts (Li et al., 7 Jun 2025, Liu et al., 2024).
Prompt Sensitivity and Visual Referring: The effectiveness of visual pointers or overlay prompts varies significantly by model—some gain >7% in accuracy, others may lose up to –17.5% (e.g., GPT-4V under partial intervention)—indicating that user interface and prompt engineering remain unsolved research areas (Li et al., 2023).
Continual and Lifelong Updating: Static model architectures are ill-equipped for cumulative fact updates, with catastrophic forgetting as a central barrier. Adaptive, memory-augmented architectures and retrieval-based mechanisms are proposed remedies (Jiang et al., 22 Oct 2025, He et al., 2023).

7. Future Research Directions

Current frontiers identified in the literature include:

Enhancing LMM sensitivity to external evidence via architectural, pretraining, and training paradigm changes—targeting real-world, multi-context, and time-varying knowledge integration (2505.19509, Jiang et al., 22 Oct 2025).
Development of more robust, scalable, and efficient multimodal pretraining protocols, especially for low-resource languages, efficient continual adaptation, and dynamic fusion mechanisms (Lupascu et al., 8 Feb 2025).
Expansion into high-stakes domains: wireless AI-native networks, real-time autonomous driving, embodied agents—with an emphasis on robust perception, planning, and safe adaptation (Xu et al., 2024, Zhang et al., 13 Jan 2026, Yang et al., 23 Oct 2025).
New explainability frameworks blending concept-based decomposition, causal mediation, and modular function vectors to provide human-interpretable, actionable control over model behavior (Parekh et al., 2024, Fu et al., 2 Oct 2025).
Benchmark extensions covering naturalistic multimodal conflicts, dynamic multimodal knowledge graphs, video and audio modalities, and global benchmarks for multilingual, multimodal, multisensory understanding (2505.19509, Wang et al., 2024).

Accelerated progress in these areas will be essential for advancing LMMs from powerful multimodal transformers to robust, trustworthy, and explainable AI systems across diverse real-world tasks.