MultiVENT 2.0: Multimodal Video Retrieval Benchmark

Updated 4 February 2026

MultiVENT 2.0 is a large-scale, multilingual, multimodal benchmark that advances event-centric video retrieval by integrating visual, audio, textual, and embedded text signals.
It comprises over 218,000 videos and 3,906 queries, simulating real-world challenges across diverse news events and varying video styles.
Evaluation metrics and baseline comparisons reveal significant modality gaps, underscoring the need for novel fusion strategies and modality-specific optimizations.

The MultiVENT 2.0 benchmark is a large-scale, multilingual, multimodal resource for event-centric video retrieval, specifically designed to challenge and advance the capabilities of retrieval systems that must integrate visual, auditory, textual, and embedded text signals at scale. Developed to address limitations of prior datasets—such as language paucity, lack of event focus, and heavy reliance on single modalities or metadata—MultiVENT 2.0 establishes a demanding testbed that simulates the complexity and heterogeneity of real-world news and event video collections (Kriz et al., 2024, Wan et al., 6 Jun 2025, Zhan et al., 11 Jun 2025).

1. Dataset Composition and Multimodal Structure

MultiVENT 2.0 consists of over 218,000 videos, split into roughly 108,500 for training and 109,800 for evaluation. The videos originate from two primary sources: (1) all videos from the initial MultiVENT 1.0 (2,396 clips covering major news events, reserved for testing), and (2) a carefully subsampled set of InternVid (over 7M YouTube-oriented videos, 40K per language), after deduplication and filtering to restrict length (≤5 minutes) and eliminate overlap. Event categories are broad, including natural disasters, elections, protests, social gatherings, science and technology, and other salient real-world occurrences.

The corpus is robustly multilingual, encompassing six languages (Arabic, Chinese, English, Korean, Russian, Spanish), with queries in the test set probing cross-lingual generalization and a small “Unknown” category for long-tail coverage. Each video is enriched with four modalities:

Visual: Ten (sometimes twenty-four) keyframes per video extracted via scene-change detection.
Audio: Complete audio waveform, ASR transcripts via Whisper.
Embedded text: OCR outputs from a hybrid transformer + CTC pipeline.
Text metadata: Titles, uploader descriptions, and machine-generated platform captions.

This multimodal richness reflects the diversity of broadcast, edited, and raw user-generated content, with styles ranging from professional, high-definition news to "True-Raw" handheld cell phone footage (Kriz et al., 2024, Zhan et al., 11 Jun 2025).

2. Query Design and Relevance Judgments

Query construction in MultiVENT 2.0 is event-focused and modality-targeted. Across 3,906 queries, flavors include:

Base Event: High-level event summary (e.g., “2022 Lotus Garden China Telecom Building fire”).
Description: Queries that rely solely on human-written YouTube metadata.
Speech: Queries answerable by spoken content (ASR).
Embedded Text: Queries requiring on-screen text (OCR).
Specific: Focused, sub-event queries for MultiVENT 1.0 clips (e.g., “Who responded to the disaster?”).

Queries are curated to force reliance on a diverse combination of modalities, with certain queries only answerable via subspan granularity (e.g., a visual or speech snippet).

Relevance judgments are obtained through a semi-automatic annotation pipeline. Judges label video-query pairs, and additional gold-standard assessments are added for high-recall candidates. Monotonicity rules (Base vs. Specific queries) induce "silver" labels, culminating in over 16,000 judged pairs. The protocol ensures that ≈39% of top-10 results in the test split have explicit human judgments (Kriz et al., 2024).

3. Task Formulation, Evaluation Protocols, and Metrics

The formal task is to rank all videos $v \in V$ for each $q \in Q$ , maximizing the ranks of relevant videos $R(q)\subseteq V$ . The key metrics include:

Recall@K (R@K): Fraction of queries for which at least one relevant result appears in the top $K$ .
Precision@K: Average fraction of relevant instances among the top $K$ .
Mean Average Precision (mAP): Mean average over precisions at all relevant positions.
Normalized Discounted Cumulative Gain (nDCG@K): Incorporates graded relevance, with $\mathrm{rel}_i \in \{0,1,3\}$ for not/somewhat/very relevant.
Reciprocal Rank (RR): Average inverse rank of the highest relevant item per query.

Two test settings are defined: “Test-noDesc” (no metadata available at retrieval) and “Test-Desc” (metadata allowed), emulating practical scenarios where metadata may be incomplete or restricted.

Queries may refer to entire videos or to specific subspans, increasing the challenge and necessitating temporal and multimodal localization abilities in retrieval models (Kriz et al., 2024, Zhan et al., 11 Jun 2025, Wan et al., 6 Jun 2025).

4. Baseline Retrieval Methods and Benchmark Performance

A panel of pre-trained vision-LLMs (VLMs) and modality-specific pipelines are evaluated under the “Test-noDesc” regime. Notably, strong VLMs such as InternVideo2.0, VAST, VALOR, and LanguageBind experience significant reductions in performance compared to benchmarks like MSR-VTT (e.g., LanguageBind $R@10=0.355$ on MultiVENT vs. $R@10>0.80$ on MSR-VTT).

Pipeline baselines leveraging mCLIP embeddings reveal non-trivial modality gaps:

Model	Modality	R@10	nDCG@10
Vision (10 keyframes)	Visual only	0.333	0.303
OCR $\rightarrow$ mCLIP	Embedded text only	0.227	—
Speech $\rightarrow$ mCLIP	ASR only	0.290	—
Description $q \in Q$ 0 mCLIP	Metadata only	0.293	—
LanguageBind (all modalities)	Multimodal (average)	0.355	0.324

Performance degrades relative to standard video retrieval datasets, demonstrating the increased difficulty of event-centric, multilingual, untrimmed video (Kriz et al., 2024, Zhan et al., 11 Jun 2025, Wan et al., 6 Jun 2025).

Breakdowns reveal:

Language sensitivity: Visual features are effective for English/Arabic but underperform for Korean/Chinese; OCR and speech partially recover for other languages.
Video style: Speech excels in professional news; vision is more stable across video types; raw clips are considerably more challenging.
Query type: Vision performs best for “Base Event” queries, while OCR is critical for “Embedded Text” queries.

5. Advances in Multimodal Retrieval: Synthetic Expansion and Model Design

MultiVENT 2.0++ extends the benchmark with synthetic, modality-targeted queries generated for 91K previously unannotated videos using prompted LLMs (Gemma-3 27B). Synthetic queries target audio, OCR, or metadata as their primary modality, facilitating modality-aware retriever training. The resulting set encompasses 371,644 query-video pairs (367,644 for training, 4,000 for validation, 1,504 human-annotated for test) (Wan et al., 6 Jun 2025).

Conventional fusion strategies—such as averaging similarity scores across modalities for multi-stream retrievers—are found to degrade performance due to signal noise from irrelevant modalities. Instead, approaches such as CLaMR introduce late-interaction, modality-wise scoring, and contrastive objectives that explicitly encourage the retriever to select the most informative modality for each query.

Modality-wise late interaction, proposed as $q \in Q$ 1, is empirically superior to simple fusion. CLaMR achieves substantial performance gains: on MultiVENT 2.0++ nDCG@10 rises by more than 25.6 points over the best single-modality retriever and 35.4 points over the best aggregation baseline (Wan et al., 6 Jun 2025).

6. Unified Models and MAGMaR Shared Task Results

OmniEmbed (Tevatron 2.0 backbone) exemplifies a unified retrieval model architecture adapted for MultiVENT 2.0. It features:

Modality-specific encoders (Qwen2.5-Omni for text, ViT-style Vision-Thinker, audio convolutional-transformer).
Per-modality embeddings ( $q \in Q$ 2, $q \in Q$ 3, $q \in Q$ 4 in $q \in Q$ 5) projected to a shared 512-dimensional embedding space.
At inference, average (or sum) fusion produces $q \in Q$ 6.

Finetuning uses a combination of InfoNCE contrastive loss and triplet hinge losses for intra- and cross-modal consistency. Hard negatives (triplet components) are mined using DRAMA-1B. Evaluation on 2,549 queries across 109,800 test videos demonstrates:

In-domain multimodal fusion (text+ASR+video+audio) achieves the highest overall nDCG@10: 0.753, outperforming text-only (0.710–0.734) and prior multimodal baselines (0.324 LanguageBind, official).
Non-text modalities alone (video+audio) can reach parity with text streams.
Robustness to modality mismatch between training and test is observed (Zhan et al., 11 Jun 2025).

7. Analysis, Limitations, and Future Directions

Empirical analysis highlights:

No single modality suffices for comprehensive retrieval; fusion is essential.
Zeroshot transfer of existing VLMs underperforms dramatically, indicating the limits of single-modality or naively aggregated models.
Cross-modal corroboration is necessary, but overly strict modality-targeted training can reduce generalization—balanced, context-aware objectives yield best results.
Substantial remaining gaps exist for raw, noisy, low-resource, and cross-lingual scenarios.

Future work advocated in the literature includes:

Enhanced temporal modeling: capturing event progression and alignment across frames.
Finer-grained audio-text alignment (e.g., phoneme embeddings, forced alignment).
Improved hard negative generation, incorporating multimodal cues.
Extension to larger corpora and underrepresented languages.
Methods for partial and subclip-level relevance, addressing the challenge of queries anchored to specific moments within longer videos (Kriz et al., 2024, Zhan et al., 11 Jun 2025, Wan et al., 6 Jun 2025).

MultiVENT 2.0 thus establishes a challenging, richly annotated benchmark that catalyzes progress toward retrieval systems capable of robust, multimodal, and multilingual event understanding in real-world media environments.