Video Foundation Models Overview

Updated 26 December 2025

VidFMs are fixed-weight neural networks pre-trained on diverse video data to generate transferable embeddings for multiple downstream tasks.
They leverage architectures—image-based, video-based, and multimodal—with pretraining methods like masked video modeling and contrastive alignment.
Scaling strategies, low-cost adaptation techniques, and rigorous benchmarking enable practical deployment in video classification, retrieval, and captioning.

A Video Foundation Model (VidFM) is a fixed-weight neural network trained on large-scale, diverse video (and often multimodal) data to produce general-purpose video embeddings suitable for a wide range of downstream video analysis and understanding tasks—classification, retrieval, localization, captioning, and more. VidFMs have emerged as a video-domain analogue of the foundational models that have transformed natural language processing and image understanding, with the goal of providing robust, transferable video representations that require minimal task-specific adaptation (Madan et al., 2024, Wang et al., 2024, Lee et al., 2024).

1. Definitions and Taxonomy of Video Foundation Models

A core property of a VidFM is that, given an input video sequence $X = \{I_t\}_{t=1}^L$ , the model $f_\theta$ outputs a general embedding $\mathbf{z} = f_\theta(X) \in \mathbb{R}^d$ , to be consumed by lightweight task heads or decoders. VidFMs are distinguished from task-specialized video models by scale, diversity of pretraining objectives/data, and their capacity to serve a multiplicity of downstream use cases without model weights or architecture changes (Lee et al., 2024, Madan et al., 2024, Yuan et al., 2023, Li et al., 2024).

VidFMs are typically classified along three axes:

Image-based VidFMs: Adapt large frozen image–text models (e.g., CLIP, DINOv2) for video by inflating input, adding light spatiotemporal adapters or pooling (Li et al., 2023, Li et al., 2023). Temporal reasoning is often shallow or implicit.
Video-based VidFMs: Employ native spatiotemporal architectures (e.g., Video ViTs, UniformerV2) with explicit temporal modeling and self- or contrastive supervision directly on videos (VideoMAE, InternVideo) (Wang et al., 2022, Wang et al., 2024).
Universal/Multimodal FMs: Jointly learn over video, image, text, and audio streams, using multi-encoder, joint-encoder, or mix-encoder architectures combined with masking, contrastive, and next-token objectives (InternVideo2, VATT, VALOR) (Wang et al., 2024, Madan et al., 2024, Lee et al., 2024).

2. Core Architectures and Pretraining Paradigms

Architecture choice in VidFMs is intimately linked with pretraining objectives and scale. The predominant design space encompasses the following:

2.1 Transformer Backbones

Vision Transformer (ViT) variants: Patchified input sequences from sampled video frames, with learnable or sinusoidal spatiotemporal positional encodings (Wang et al., 2024, Li et al., 2023, Wang et al., 2022).
UniformerV2 / Hybrid Models: Combine spatiotemporal ViT modules with local 3D convolutional/kernels for video-specific inductive bias (Wang et al., 2022).

2.2 Pretraining Objectives

VidFMs are typically pretrained by optimizing combinations of:

Masked Video Modeling (MVM): Predict/reconstruct masked spatial–temporal patches of input videos. Often employs high mask ratios (≥0.8), semantic and sparse masking, sometimes with token-level selection driven by a frozen teacher (CLIP) (Li et al., 2023, Wang et al., 2022, Wang et al., 2024).
Contrastive Video–Text Alignment: Jointly encode clips and captions and use InfoNCE to align representation spaces, as in CLIP for image–text (Li et al., 2023, Wang et al., 2022, Wang et al., 2024).
Feature or Token Distillation: Force student video tokens to match features from a frozen, strong teacher network (e.g., CLIP, InternVL-6B, VideoMAEv2) (Li et al., 2023, Wang et al., 2024).
Masked Language Modeling (MLM): Apply cross-modal decoders to reconstruct masked language tokens given video features, encouraging cross-modal fusion (Li et al., 2023, Wang et al., 2024).
Next-Token Prediction / Generative Video–Language Modeling: Unify video encodings with open-domain LLMs for video-centric dialogue and long-form video QA (Wang et al., 2024).

Table 1: Canonical Pretraining Objectives in VidFMs

Objective	Model Examples	Role in Pretraining
Masked Video Modeling	VideoMAE, InternVideo2	Spatiotemporal feature learning
Video–Text Contrastive	CLIP, InternVideo, UMT	Semantic alignment, retrieval
Feature/Token Distillation	UMT, InternVideo2	Efficient training & transfer
Masked Language Modeling	InternVideo2, UMT	Multimodal fusion

3. Training and Adaptation Strategies

3.1 Post-Pretraining and Adaptation

Newer work demonstrates efficient ways to "harvest" VidFMs from image models by patch dropping (e.g., drop 90% of video tokens) and masked language modeling during short post-pretraining (Li et al., 2023). This approach dramatically speeds up training (e.g., <24h on 8 A100 GPUs for WebVid-10M scale) with minimal quality drop compared to from-scratch video pretraining.

Low-rank adapters (LoRA, task-specific poolers), lightweight attention heads, and staged unfreezing enable efficient domain adaptation and specialization (Barreto et al., 23 Oct 2025, Yuan et al., 2023). Recent benchmarks (VideoEval, VideoGLUE) show that light adapters or attentive probes allow nearly finetuning-level accuracy at orders-of-magnitude lower cost (Li et al., 2024, Yuan et al., 2023).

Models such as InternVideo2 and Universal FMs integrate video, image, text, and audio at scale by progressive training: initial video-only training, followed by multimodal contrastive/matching and finally next-token LLM objectives (Wang et al., 2024, Madan et al., 2024). Semantic temporal segmentation, caption synthesis, and LLM fusion (e.g., VideoBLIP) enable sophisticated video-centric dialogue, instruction following, and long-form temporal reasoning (Wang et al., 2024).

3.3 Parameter and Data Scaling

Scaling backbone size (e.g., 1 billion to 6 billion params), batch, and data (e.g., 100+ million video–caption pairs) is empirically linked with improved generalization and benchmark performance (Wang et al., 2024, Madan et al., 2024). Semantic masking and sophisticated shot-segmentation are used to maximize information content per sample.

4. Evaluation Protocols and Downstream Benchmarking

4.1 Task Spectrum and Metrics

VidFMs are evaluated across more than a dozen canonical tasks, commonly partitioned as:

Video Content Understanding: Action recognition (Top-1 accuracy on Kinetics), retrieval (R@1 on MSR-VTT, DiDeMo, VATEX), spatiotemporal and temporal action localization (mAP, IoU), open-ended event understanding.
Descriptive Understanding: Video question answering (VQA accuracy), captioning (CIDEr, METEOR), multi-modal sentiment analysis.
Video Generation: Fréchet Video Distance (FVD), CLIP similarity, Inception Score (IS) for text-to-video and inpainting/outpainting.

Recent comprehensive evaluations (VideoGLUE, VideoEval) stress the need for fair comparison by standardizing input frame stride, N, model scale, and pretraining data (Li et al., 2024, Lee et al., 2024).

4.2 Generalization and Robustness

Benchmarks demonstrate that scaling up labeled or weakly-labeled video data does not always increase generalization, and that FMs trained for action recognition may not excel on content moderation, emotion, or quality assessment (Li et al., 2024). Hybrid pretraining paradigms (MVM+CL; e.g., UMT-stage2, InternVideo2-stage2) can improve general adaptation but sometimes weaken vision-centric embeddings.

Attention is paid to evaluating both appearance and motion understanding, with frame sampling protocols and linear probing as primary tools for holistic comparison (Lee et al., 2024). For fine-grained human activity recognition under domain and viewpoint shift, image-based FMs with attention-based temporal fusion can match or outperform video-based FMs (Ponbagavathi et al., 2024).

Table 2: Linear-Probing Accuracy (\%) for Representative VidFMs on Standard AR Benchmarks (ViT-L, N=16, s=4)

Model	K400	MiT	SSv2	DV48	Avg.
TWLV-I	80.2	36.3	46.4	32.8	48.6
InternVideo2	81.8	40.5	40.1	23.7	45.8
UMT (stage 2)	76.6	35.7	31.0	21.4	40.9
V-JEPA	70.8	28.8	52.5	22.8	44.1

5. Insights, Limitations, and Future Directions

Recent studies show surprising performance trends:

Image-based VidFMs (e.g., UMT, CLIP-based: Efficient adaptation of image–text FMs outperforms many native video models on classic discriminative tasks, attributed to massive pretraining data scale and semantic alignment from language supervision (Madan et al., 2024, Li et al., 2023).
Video-native VidFMs: Better for motion-rich and temporal localization tasks, maintaining higher performance under frozen-weight or lightweight adaptation (Yuan et al., 2023, Lee et al., 2024).
Multimodal VidFMs: UFMs such as InternVideo2, when trained with staged video/vision/audio/text objectives, achieve state-of-the-art results across retrieval, long-form dialogue, and audio-visual tasks (Wang et al., 2024, Madan et al., 2024).

However, downstream task specialists continue to outperform VidFMs on certain real-world tasks; trade-offs exist between adaptation flexibility and ultimate accuracy (Yuan et al., 2023). The field faces several open challenges:

Robust temporal and causal reasoning, especially for long-range, multi-action, or cross-domain video sequences (Wang et al., 2022, Wang et al., 2024).
Efficient scaling and adaptation to novel domains with limited labels (object-centric, industrial, medical, or safety-sensitive domains) (Barreto et al., 23 Oct 2025, Li et al., 2024).
3D spatial awareness and generalization to viewpoint, motion, and cross-modal 3D tasks (Huang et al., 23 Dec 2025).
Ethical deployment for manipulation, deepfake detection, and content moderation, requiring robust, explainable, and safe VidFMs (Madan et al., 2024).

Research trends include: multimodal instructional tuning, diffusion-based and latent-feature generation models for world modeling (Boduljak et al., 12 Dec 2025), feature–based video forecasting, progression to larger-scale multimodal datasets, and generalization-oriented architectures and probe protocols.

6. Practical Considerations and Recommendations

Efficient Post-Pretraining: Patch-dropping and text-masking approaches allow harvesting of strong VidFMs directly from image FMs with orders-of-magnitude less compute, facilitating broader access and sustainability (Li et al., 2023).
Adaptation Protocols: Prefer attentive probes or adapters over end-to-end finetuning for most practical applications; this enables cost-effective benchmarking and deployment (Yuan et al., 2023, Li et al., 2024).
Benchmarking: Evaluate VidFMs on a wide spectrum of tasks—including those emphasizing adaptation, generalization, and non-action-centric content—to avoid overfitting research to saturated leaderboards (Li et al., 2024).
Hybrid Objectives: Combine masked modeling with contrastive and generative pretraining whenever possible to jointly optimize spatial, temporal, and semantic alignment (Madan et al., 2024, Wang et al., 2024).
3D Probing and World Models: Model-agnostic 3D probing demonstrates that VidFM features encode substantial 3D awareness even without explicit 3D pretraining, offering new direction for low-data and robotics tasks (Huang et al., 23 Dec 2025).

The continued evolution of Video Foundation Models is characterized by scaling unified, multimodal, and efficiently adapted neural architectures for video understanding. Through an overview of large-scale pretraining—masked modeling, contrastive alignment, and generative methods—with judicious adaptation and rigorous evaluation, next-generation VidFMs are expected to further bridge the gap between perception, language, audio, and temporally coherent reasoning across the dynamic visual world (Wang et al., 2024, Madan et al., 2024, Lee et al., 2024, Wang et al., 2022).