WebVid-CoVR: Composed Video Retrieval Dataset

Updated 30 January 2026

WebVid-CoVR is a large-scale benchmark dataset for composed video retrieval, structured as triplets with a reference video, modification text, and a target video.
It leverages millions of video-caption pairs enhanced with LLM-generated and densely annotated modification texts to enable fine-grained, context-aware retrieval.
Experiments demonstrate that models trained on WebVid-CoVR outperform previous methods in both video and zero-shot image retrieval tasks, driving progress in compositional reasoning.

WebVid-CoVR is a large-scale benchmark and data set specifically developed for the composed video retrieval (CoVR) task—a paradigm in computer vision where the system is required to retrieve a target video given a reference video and an associated natural-language modification string detailing compositional change requirements. The WebVid-CoVR data set, derived from the WebVid10M and WebVid2M collections, enables quantitative evaluation and training of retrieval models that can integrate visual semantics with human-like modification instructions. Extensions such as Dense-WebVid-CoVR further augment this resource with fine-grained, richly annotated modification texts, enabling research into models with enhanced contextual fidelity and compositional reasoning (Thawakar et al., 2024, Thawakar et al., 19 Aug 2025, Ventura et al., 2023).

1. Formal Structure and Definition

Each sample in WebVid-CoVR is represented as a triplet $S = (v_{\mathrm{base}}, t_{\mathrm{mod}}, v_{\mathrm{target}}) \in \mathcal{V} \times \mathcal{T} \times \mathcal{V}$ , where:

$v_{\mathrm{base}}$ : Reference/query video serving as the retrieval anchor,
$t_{\mathrm{mod}}$ : Free-form natural-language text describing the required transformation from the reference to the target video,
$v_{\mathrm{target}}$ : Target video expected to reflect the changes described in $t_{\mathrm{mod}}$ .

This structure directly reflects the core CoVR task: mapping an input pair (video, modification text) to a corresponding output video exhibiting the compositional changes specified.

2. Dataset Construction, Annotation Protocols, and Splits

The collection of WebVid-CoVR leverages large-scale mining and LLM-driven text generation, further extended in Dense-WebVid-CoVR for fine-grained annotation:

Base Corpus: WebVid2M (Ventura et al., 2023) and WebVid10M supply millions of video-caption pairs.
Sampling: Caption pairs differing by a single token are identified via CLIP embeddings and FAISS nearest-neighbor search; video pairs are selected based on CLIP-visual similarity of the middle frames.
Modification Text Generation: LLMs (LLaMA-7B, GPT-3/3.5, MiniGPT-4, Gemini-Pro, GPT-4o) are prompted to produce human-like modification instructions.
Post-processing: Outputs are filtered for grammaticality, semantic correctness, and contextual consistency. In manual evaluation and test sets, human annotators vet triplet validity and select best modification texts.
Scale and Splits: Standard WebVid-CoVR contains ~1.65M triplets from 131K videos (average duration ≈ 16.8 s), with ∼467K unique modification texts (mean length 4.8 words) (Ventura et al., 2023, Thawakar et al., 2024). Dense-WebVid-CoVR extends this with 1.6M triplets annotated for dense modification texts (mean length ≈31.2 words) and video descriptions averaging 81.3 words (Thawakar et al., 19 Aug 2025). Official validation and test splits comprise 7K and ~3K manually verified triplets, respectively.

3. Linguistic Content and Modification Text Diversity

Modification texts within WebVid-CoVR are characterized by the following:

Format: Un-templated, concise, human-like English (“Make the sky orange instead of blue.”, “Change the white tulip to yellow.”).
Vocabulary: ~15K unique tokens in training, after BPE tokenization, lowercasing, and non-ASCII filtration.
Categories (synthetic train set):
- Color changes: ~35%
- Object addition/removal: ~25%
- Action/scene changes: ~20%
- Others (e.g. lighting, orientation): ~20%
Dense Annotation: Dense-WebVid-CoVR modification texts are ×7 longer than WebVid-CoVR, emphasize compositional, contextual, and temporally precise information, and are validated with a seven-step annotation protocol including visual checks and cosine similarity screenings (Thawakar et al., 19 Aug 2025).

The variety and linguistic richness of modification text enable learning of discriminative multimodal embeddings tuned to nuanced user queries.

4. Multimodal Integration and Retrieval Objectives

WebVid-CoVR supports joint modeling of visual and textual cues:

Modeling: Query (video, text) pairs are encoded via BLIP-2 with cross-attention. Detailed video descriptions, generated via MiniGPT-4 or Gemini-Pro, are concatenated or fused with modification text for grounding.
Video Representation: Each video is represented by 15 uniformly sampled frames; frame features are weighted and aggregated for robust retrieval (Thawakar et al., 2024, Ventura et al., 2023).
Retrieval Criterion: Target video maximizes similarity to the joint (query, modification) embedding under the scoring function:

$v^* = \underset{v \in \mathcal{V}}{\arg\max} \;\mathrm{sim}(f(v_{\mathrm{base}}, t_{\mathrm{mod}}), g(v))$

Training Objectives: Hard-negative InfoNCE contrastive loss (HN-NCE), optionally symmetric across query/target, and triplet ranking loss formulations are employed.

5. Benchmark Metrics and Ablation Results

Performance is reported using recall@K (R@1, R@5, ...), the fraction of queries where the target appears within the top-K retrieved videos.

Model/Setting	R@1	R@5	R@10	R@50
Zero-shot CLIP/BLIP (avg)	~44–45	—	—	—
CoVR-BLIP (HN-NCE, CA, N=15)	53.1	79.9	86.9	97.7
Dense-WebVid-CoVR (Visual+Text+CA)	71.3	89.1	94.6	98.9

Models trained on WebVid-CoVR yield strong transfer to zero-shot composed image retrieval (CoIR), outperforming prior approaches on CIRR (R@1 from 35.4 to 38.5) and FashionIQ (R@10 from 24.7 to 27.7) (Ventura et al., 2023, Thawakar et al., 2024).

Ablation studies show that:

Data scale increases recall (tripling seed videos from 500K→2.5M raises CIRR R@1 by 6 points).
Transitioning from rule-based to LLM-generated modification texts improves retrieval performance by 8–11 points.
Dense modification texts, as in Dense-WebVid-CoVR, further enhance fine-grained retrieval (+2.4 points R@1).

6. Applications, Limitations, and Future Prospects

WebVid-CoVR and its dense variant support research into vision-language retrieval, context enrichment, and compositional modeling, with dual applicability to video (CoVR) and image (CoIR) retrieval tasks (Thawakar et al., 2024). Models trained on these data sets generalize from video to image retrieval benchmarks in zero-shot fashion, suggesting broad utility.

Key limitations include residual noise in automatically generated modification texts (2–3% for Dense-WebVid-CoVR), incomplete noise-free annotation at scale, and the absence of formal action class labeling. Prospective directions include support for multilingual modification text, scaling to longer clips via hierarchical indexing, and user-interactive feedback mechanisms (Thawakar et al., 19 Aug 2025).

A plausible implication is that unified fusion strategies (joint video+description followed by cross-attention with modification text) yield more coherent mappings and enhanced retrieval, as compared to pairwise or late fusion schemes. Rich, descriptive modification texts are critical for fine-grained compositional video search.

7. Context and Benchmark Role

WebVid-CoVR represents the largest natural-domain composed retrieval data set for video (Ventura et al., 2023). By enabling automatic triplet construction at scale and supporting rigorous manual evaluation protocols, it provides the foundational benchmark for recent advances in context-aware, discriminative video retrieval models. With the introduction of Dense-WebVid-CoVR establishing a new state of the art in Recall@1, the benchmark is poised to catalyze further development of models tackling complex spatio-temporal and semantic compositionality in large video databases (Thawakar et al., 19 Aug 2025).

References: (Thawakar et al., 2024, Ventura et al., 2023, Thawakar et al., 19 Aug 2025)