InternVideo-Next Video Foundation Model

Updated 8 December 2025

InternVideo-Next is a general video foundation model that integrates semantic abstraction with detailed reconstruction using an Encoder–Predictor–Decoder framework.
It employs a two-stage pretraining protocol combining semantic-guided diffusion reconstruction in Stage 1 and frozen-latent prediction in Stage 2 to address limits of prior approaches.
The model achieves state-of-the-art results across action recognition, depth estimation, tracking, and zero-shot retrieval benchmarks using only public unlabeled videos.

InternVideo-Next is a general video foundation model trained entirely without video–text supervision, advancing video representation learning by integrating semantic abstraction with detail preservation. It addresses the limitations of prior masked video modeling (MVM) approaches and video–text pretraining by introducing architectural innovations and a two-stage pretraining protocol. InternVideo-Next leverages an Encoder–Predictor–Decoder (EPD) abstraction and a diffusion-based Stage 1 with semantic priors, followed by a frozen-latent prediction Stage 2. The model achieves state-of-the-art performance across action recognition, depth estimation, tracking, zero-shot retrieval, and multimodal evaluation benchmarks, all using only public unlabeled videos (Wang et al., 1 Dec 2025).

1. Motivation and Problem Formulation

Large-scale video–text pretraining approaches (e.g., VideoCLIP) achieve strong results, particularly in high-level semantics, but are constrained by noisy and synthetically generated captions with limited semantic breadth. These approaches often neglect implicit world knowledge, such as object motion, 3D structures, and physical cues. Conversely, self-supervised masked video modeling (MVM) directly exploits spatiotemporal information, but prior architectural choices introduce critical trade-offs:

Pixel-level reconstruction: MAE-style methods enforce fine detail fidelity at the cost of semantic abstraction and suffer from slow convergence.
Latent space prediction: JEPA-style approaches converge rapidly and foster semantic abstraction but encourage shortcut learning, yielding diminished local detail.

InternVideo-Next posits that these limitations stem from the entanglement of encoder and decoder architectures, leading to misalignment between low-level detail and semantic abstraction. By decomposing the architecture and adopting a novel two-stage pretraining, the model achieves disentangled, semantically meaningful, and detail-rich video representations that outperform existing self-supervised and video–text models on a diverse set of benchmarks.

2. Encoder–Predictor–Decoder (EPD) Framework

InternVideo-Next formalizes MVM architectures as a three-part pipeline:

Encoder ( $E$ ): Processes visible spatiotemporal patches from the input video and outputs latent representations.
Predictor ( $P$ ): Consumes encoder outputs for visible regions and predicts target latents for masked regions, functioning as a latent world model.
Decoder ( $D$ ): Maps predicted latents either to the pixel space (Stage 1) or latent teacher targets (Stage 2), bridging generative detail with semantic abstraction.

The flow can be summarized: $\text{Input video } X \Rightarrow \text{Mask patches} \Rightarrow E(X_{\mathrm{vis}}) = \{z_i\} \Rightarrow P(\{z_i\}) = \{\hat z_j\} \Rightarrow D(\{\hat z_j\}) \rightarrow \begin{cases} \text{pixels (Stage 1)} \ \text{latent targets (Stage 2)} \end{cases}$

Key objective functions include:

Conditional diffusion decoder loss (Stage 1):

$L_{\mathrm{diff}} = \mathbb{E}_{t, \epsilon} \Bigl\| \epsilon - \epsilon_\theta(x_t, t, z_{\mathrm{pred}}) \Bigr\|^2$

where $x_t$ is a noised patch, $z_{\mathrm{pred}}$ is the predicted latent, and $\epsilon_\theta$ is an MLP denoiser.

Semantic alignment loss (Stage 1):

$L_{\mathrm{sem}} = -\cos(E(X_{\mathrm{vis}}), \mathrm{SigLIP}(X))$

Latent prediction loss (Stage 2):

$L_{\mathrm{pred}} = \| z^\ast_t - \hat{z}_t \|^2$

where $z^\ast_t$ is the frozen Stage 1 output.

3. Two-Stage Pretraining Protocol

InternVideo-Next employs a novel two-stage pretraining scheme, each stage addressing distinct representation challenges:

Stage 1: Semantic-Guided Diffusion Reconstruction

Objective: Build a latent space that combines low-level fidelity (geometry, texture) with semantic abstraction (objects, actions).
Protocols and components:
- 80% patch masking, scheduled to prioritize semantically salient regions (top-k SigLIP attention).
- Encoder: ViT-B/16 or similar architectures.
- Predictor: Lightweight transformer, initialized from ModernBert-Large (last 5 layers) to increase semantic priors.
- Decoder: Per-patch MLP, configured with 6 residual blocks and width 1536, trained with 1000 diffusion steps (cosine noise).
- Semantic guidance via frozen SigLIP2-1B teacher.
Loss function:

$L_{\mathrm{S1}} = L_{\mathrm{diff}} + \lambda_{\mathrm{sem}} L_{\mathrm{sem}}$

with $\lambda_{\mathrm{sem}} = 1$ .

Stage 2: Semantically Coherent Latent Prediction

Objective: Acquire world dynamics (motion, causality, 3D) by predicting latents corresponding to occluded or future patches in the Stage 1-aligned space, mitigating shortcut learning.
Protocols and components:
- Both student and (frozen) teacher initialized from Stage 1.
- Multi-block spatiotemporal masking increases prediction difficulty.
- Loss function:
$L_{\mathrm{S2}} = \| z^\ast - \hat{z} \|^2$ - Stage 2 skips pixel reconstruction and unmasked-token alignment to focus temporal modeling.

4. Implementation and Pretraining Regime

Encoder ( $E$ ): ViT-B/16 or ViT-L/16 (14×14 patches, 12 blocks, 8–32 frames input).
Predictor ( $P$ ): Last 5 layers of ModernBert-Large, depth 6, hidden width 768.
Decoder ( $D$ ): Diffusion-based MLP with 6 residual blocks, width 1536.
Masking: Stage 1—80% with semantic-aware prioritization; Stage 2—multi-block, ≈80%.
Training Corpus: "K-Mash" dataset with 1.1M public videos (K400, K600, K700, SSv2, ActivityNet, HACS, MiT; deduplicated).
Compute: Stage 1—50 epochs, 64×A100, batch 2048, $16 \times 224^2$ frames; Stage 2—100 epochs, same configuration, $32 \times 224^2$ frames.

5. Empirical Results and Benchmark Performance

InternVideo-Next sets or surpasses state-of-the-art results across a spectrum of supervised, semi-supervised, and zero-shot settings:

Task	Comparison	Performance (Main Metrics)
Action Recognition (K400/SSv2/COIN)	Prev. SOTA: 86.0 / 65.9 / 90.1	88.4% / 73.0% / 93.6%
Monocular Depth (ScanNet/KITTI)	VDA Head: 9.2/92.2 & 6.7/94.6	Improves ARel/δ₁ over baselines
Object Tracking (Waymo)	V-JEPA2 L: 68.9%	72.4% mean IoU
Video Prediction (EK100)	V-JEPA2 L: 57.8/53.8/32.7	58.9 / 56.4 / 34.0 (V@5/N@5/A@5)
Zero-Shot Action Recognition	Prev. SOTA (K400): 70.7	72.1
Zero-Shot Text-Video Retrieval	MSR@1: 42.1; Ours: 43.2	Competitive or improved across sets
Chat Probing Benchmarks	MVBench: 50.6; Dream1k: 29.8	Outperforms InternVideo2 L

All evaluations use frozen encoder probing unless specified.

6. Ablation Studies and Analysis

Systematic ablations demonstrate the contribution of each component:

Stage 1: Progressively adding semantic alignment, diffusion decoder, and text-decoder initialization each brings substantial performance gains. The full configuration achieves 75.8% (K400) and 36.9% (SSv2) with ViT-B frozen probing.
Predictor Depth & Initialization: Best results are obtained using the last 5 layers of ModernBert-Large as the predictor (75.8%/36.9%). Larger depths and full ViT predictors do not improve over this configuration.
Decoder Capacity: The diffusion MLP decoder (6 blocks, width 1536) outperforms linear and shallow MLP decoders.
Semantic Teacher Selection: SigLIP2-based alignment yields the best outcomes compared to DinoV2 and CLIP-ViT teachers.
Stage 2 Variants: Freezing the teacher at Stage 1 consistently improves temporal and dynamic modeling over alternatives (momentum, zero-init, or SigLIP2 target).
Masking and Frame Count: Sensorial masking and higher frame counts lead to the highest reported recognition scores: 78.1% (K400) and 59.4% (SSv2) with Stage 2.

7. Contributions and Implications

InternVideo-Next introduces architectural disentanglement via the EPD framework and demonstrates that semantically guided diffusion reconstruction, followed by frozen-latent prediction, can achieve representations excelling at both low-level detail and high-level semantics. This approach outperforms both existing masked modeling and video–text pretraining on varied tasks without any explicit video–text supervision. The findings indicate a scalable, annotation-free path forward for general video foundation models in scenarios where textual supervision is inadequate or unavailable (Wang et al., 1 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to InternVideo-Next.