Video Multimodal LLMs

Updated 7 February 2026

Video Multimodal LLMs are architectures that combine language models with video, audio, and sensor inputs to enable spatiotemporal and multimodal reasoning.
They employ methods such as two-stage adapters, joint end-to-end training, spatio-temporal queries, and token compression to process complex sequential data efficiently.
These models excel in tasks like video captioning, grounding, and real-time streaming while addressing challenges in scalability and hallucination mitigation.

Video Multimodal LLMs (VLLMs) are architectures that extend LLMs to process and reason over video content, often incorporating complementary modalities such as audio, object tracks, and sometimes external sensor data. The integration of LLMs with video understanding represents a convergence between natural language processing, computer vision, and multimodal representation learning, addressing the highly structured spatiotemporal nature of video.

VLLMs depart from image-LMs by addressing the temporal dimension in addition to spatial and multimodal (e.g., auditory) information. Early and representative VLLMs typically rely on established visual encoders—frozen or trained end-to-end—that extract per-frame features, which are then aligned or adapted to the LLM token embedding space. Prominent architectures fall broadly into:

Two-Stage/Plug-and-Play Adapters: Models such as Video-LLaMA (Zhang et al., 2023) and RED-VILLM (Huang et al., 2024) use frozen image/audio encoders (ViTs, CNNs, CLIP, ImageBind) and introduce lightweight temporal modules (e.g., Q-Formers, temporal adapters) that pool and aggregate temporal features from frames or audio segments. The outputs are linearly projected and provided as a soft prompt or prefix to a frozen LLM (e.g., LLaMA, Vicuna) for downstream text generation.
Joint End-to-End Models: More tightly coupled systems like ResNetVLLM (Khalil et al., 20 Apr 2025) train a backbone ResNet visual encoder from scratch, concatenated at the token level with textual prompts, and use a shared transformer with optional cross-attention between visual and text tokens. This enables end-to-end learning of visual and linguistic features for seamless cross-modal fusion.
Unified Representations: Video-LLaVA (Lin et al., 2023) introduces "alignment before projection," unifying both image and video features into a shared text-feature space via pre-aligned encoders (e.g., LanguageBind), then projecting into the LLM embedding space, substantially improving performance and convergence in joint image/video training.
Spatio-Temporal Query Architectures: Models such as SpaceVLLM (Wang et al., 18 Mar 2025) use interleaved spatio-temporal queries—specialized tokens distributed amid per-frame patch embeddings—plus a query-guided spatial decoder that allows explicit mapping between queries and coordinates/bounding boxes at each frame, enabling explicit spatio-temporal grounding.

2. Temporal Reasoning and Video Token Representation

Capturing temporal correlations is essential for video understanding. Several architectural innovations have been introduced:

Temporal Pooling and Query Mechanisms: Video Q-Formers (Zhang et al., 2023) cross-attend to per-frame features using a fixed set of learnable temporal queries, aggregating temporally distributed salient elements across the video. RED-VILLM (Huang et al., 2024) computes both spatial and temporal pooled representations, passing these through plug-and-play temporal modules to adapt off-the-shelf image LLMs to the video domain.
Transformer-Based Approaches: Certain models implement full-sequence transformers over concatenated visual and textual tokens, using position embeddings to encode frame/time index, implicitly capturing temporal dependencies through self-attention (Khalil et al., 20 Apr 2025, Chen et al., 2024).
Memory-Based Streaming and Selection: VideoStreaming (Qian et al., 2024) employs a memory-propagated streaming encoder, segmenting long videos into clips, encoding each with context from the previous, and maintaining a fixed-sized memory. At inference, an adaptive Gumbel-TopK selection identifies the question-relevant memory slices, which are concatenated with the user prompt for autoregressive answer generation.
Token Compression and Efficiency: DyCoke (Tao et al., 2024) introduces a training-free dynamic compression pipeline that reduces temporal redundancy via windowed merging and prunes spatially redundant tokens in the decoder key-value caches, achieving substantial speedup and memory reduction without impacting accuracy.

3. Multimodal Integration: Audio, Gaze, and Sensor Streams

While initially focused on RGB frames, top VLLMs now handle additional modalities:

Audio-Visual Integration: Audio-Visual LLM (Shu et al., 2023) and Video-LLaMA (Zhang et al., 2023) both leverage modality-specific encoders (e.g., CLAP for audio; CLIP for frames) plus cross-modal projection and gating. These models support visual-only, audio-only, or joint AV inference by applying explicit gating tokens that activate the necessary encoders and mask irrelevant channels during both training and inference. Modality-augmented training regimens yield strong performance on both video QA and audio captioning tasks.
Sensor and Egocentric Inputs: GazeLLM (Rekimoto, 31 Mar 2025) incorporates synchronized egocentric video–gaze data, mimicking foveal vision via eye-tracker-guided cropping. Processing high-res crops near the gaze point and discarding periphery achieves competitive comprehension at one-tenth the pixel/memory budget, pointing to an efficient path for scaling VLLMs to long or high-resolution inputs.
Object Trajectories and Scene Graphs: MVU (Ranasinghe et al., 2024) extracts object-centric modalities—object presence, spatial location, and motion trajectories—using vision tools like OWL-ViT and converts them into natural-language tokens for prompt-based fusion, thus providing the pretrained VLM with modular world knowledge for robust long-video reasoning.

4. Training Protocols, Datasets, and Instruction Tuning

Progress in VLLMs has been catalyzed by careful design of instruction-tuning corpora, cross-modal alignment strategies, and efficient transfer from vision or image LLMs:

Pretraining Regimes: Most models pretrain their visual encoders and LLMs on separate large-scale datasets: e.g., WebVid-2M for video–text, CC595k for images, and multi-turn dialogue datasets from MiniGPT-4, LLaVA, or in-house instruction data (Zhang et al., 2023, Li et al., 2023, Lv et al., 2024).
Instruction Tuning: Multi-stage protocols first pretrain on raw caption data, then fine-tune cross-modal adapters on high-quality dialogue or instruction pairs. RED-VILLM (Huang et al., 2024) demonstrates that fine-tuning <1% of parameters ( $\theta_t, \theta_z$ ) on only ~100 K video-text instructions yields SOTA performance, exploiting the frozen alignment of the backbone image-LLM.
Temporal and Spatial Supervision: For temporally and spatially localized tasks, datasets such as Charades-STA (temporal intervals), RefCOCO+/HCSTVG (object boxes), and the large synthetic Uni-STG (Wang et al., 18 Mar 2025), provide explicit span or bounding box supervision. Joint optimization over language and spatial losses enables direct temporal or spatio-temporal grounding.
Streaming and Dialogue: LIVE (Learning-In-Video-Stream) (Chen et al., 2024) converts offline temporally-annotated datasets into streaming dialogue format for supervision. Training losses couple next-token LM objectives with explicit EOS prediction at non-response frames to support long-context streaming inference.

Model	Frozen Backbones	Temporal Module	Multimodal Capability	Key Training Corpus
Video-LLaMA	ViT-G/14, ImageBind	Q-Formers	Audio-Visual	WebVid-2M, MiniGPT-4 instruct
RED-VILLM	ViT/CLIP, LLaVA	Plug-in Temporal MLP	Visual	ActivityNet-QA instruct
ResNetVLLM	ResNet (learned)	Transformer	Visual	Video-ChatGPT-100K
SpaceVLLM	SigLIP, Qwen2	Spatio-Temporal Query	Spatio-Temporal	Unified Spatio-Temporal Grounding (Uni-STG)
Audio-Visual LLM	CLIP, CLAP/HTSAT	Linear + Gating	Audio-Visual	WavCaps, WebVid2M, VGGSound
VideoStreaming	CLIP+Phi2	Streaming Encoder	Visual	MovieNet-QA, NextQA

5. Task Coverage, Capabilities, and Performance

VLLMs now reflect a broadening coverage of tasks beyond classical video QA:

Dense Video Captioning and Summarization: VideoNarrator (Wu et al., 22 Jul 2025) generates temporally aligned, high-quality dense captions for segments, using only off-the-shelf VLLMs as modular generators, context enrichers, and verifiers, reducing hallucinations and supporting downstream summarization and retrieval applications.
Spatio-Temporal Grounding: SpaceVLLM (Wang et al., 18 Mar 2025) achieves SOTA on benchmarks requiring both interval and spatial object localization (e.g., HCSTVG, Charades-STA), enabled by its spatio-temporal queries and query-guided spatial decoders.
Temporal Grounding and Activity Localization: Two-stage schemes (Song, 2024) using image-based LLMs for per-frame description followed by text-LM temporal reasoning often outperform monolithic video-LLMs even at similar parameter scale.
Audio-Visual Reasoning: Audio-Visual LLM (Shu et al., 2023) and Video-LLaMA (Zhang et al., 2023) exhibit strong performance on video QA and audio captioning, showing the importance of modality-augmented training and explicit modality selection.
Long-Video and Streaming Understanding: VideoLLM-online (Chen et al., 2024) implements streaming dialogue over videos exceeding thousands of frames, with real-time inference and explicit temporal alignment via streaming EOS loss, outperforming prior clip-based systems on streaming benchmarks.
Hallucination Diagnosis and Decoding: Compositional hallucinations—errors emerging from entangled spatiotemporal concepts—are systematically analyzed via OmniVCHall and mitigated by TriCD (Xing et al., 31 Jan 2026), which enhances robustness to both isolated and compositional hallucinations via reinforcement-learned, contrastive triple-pathway decoding.

6. Efficiency, Scalability, and Memory Optimization

Scalability remains a central challenge due to the quadratic complexity in token length for transformers:

Token Pruning and Compression: DyCoke (Tao et al., 2024) fuses window-based temporal merging with dynamic KV cache reduction, yielding up to 1.5× latency improvement and 1.4× lower memory while retaining or improving task accuracy, demonstrating that effective pruning does not necessarily degrade model performance.
Streaming Memory Mechanisms: Methods such as memory-propagated streaming encoders (VideoStreaming (Qian et al., 2024)) enable streaming understanding of arbitrarily long videos by propagating and distilling condensed memory representations, achieving efficient per-question inference without full re-encoding.
Gaze-Guided Cropping: GazeLLM (Rekimoto, 31 Mar 2025) achieves major token savings and quadratic computational reductions by cropping only the gaze-relevant foveal region, empirically matched in semantic coverage to full-frame processing on natural instruction tasks.

7. Limitations, Open Problems, and Future Directions

Several persistent limitations and future opportunities are noted:

Challenges in handling long-duration videos remain, especially with limited context windows and the computational expense of dense frame sampling (Li et al., 2023, Zhang et al., 2023). Memory-efficient attention, sparse transformers, and hierarchical token selection are under active investigation (Qian et al., 2024, Tao et al., 2024).
Robust spatio-temporal grounding, motion tracking, and causal/event understanding in unconstrained video remain open problems, with performance lags on datasets requiring explanation or compositional reasoning (Wang et al., 18 Mar 2025, Xing et al., 31 Jan 2026).
Multimodal fusion at arbitrary granularity (object, region, trajectory), efficient audio-visual co-representation, and scaling to additional sensor modalities (infrastructure video, egocentric, action streams) are actively researched (Shu et al., 2023, Ranasinghe et al., 2024).
Mitigating hallucinations, especially compositional errors, is a major focus, with plug-and-play contrastive decoding and reinforcement learning-based calibration emerging as promising directions (Xing et al., 31 Jan 2026).
Future work is expected to exploit unified backbone encoders, plug-in or memory-augmented temporal modules, and systematic self-supervised pretraining, facilitated by large-scale, diverse, and high-quality instruction and grounding datasets (Wang et al., 18 Mar 2025).

In summary, modern VLLMs integrate vision, audio, and text at scale through unified representational spaces, adapter-augmented cross-modal fusion, and instruction optimization, achieving strong performance on a growing spectrum of multimodal video understanding tasks while addressing efficiency, scalability, and new failure modes (Zhang et al., 2023, Huang et al., 2024, Khalil et al., 20 Apr 2025, Wang et al., 18 Mar 2025, Shu et al., 2023, Tao et al., 2024, Qian et al., 2024, Ranasinghe et al., 2024, Lin et al., 2023, Wu et al., 22 Jul 2025).