Papers
Topics
Authors
Recent
Search
2000 character limit reached

Composed Video Retrieval (CoVR)

Updated 30 January 2026
  • CoVR is a specialized vision-language retrieval task combining reference videos with modification texts to locate matching target videos.
  • It leverages methods like frozen encoders, multi-modal fusion, and contrastive learning to achieve precise spatial and temporal alignment.
  • Benchmark datasets such as WebVid-CoVR and Dense-WebVid-CoVR demonstrate significant gains with advanced pooling and cross-attention strategies.

Composed Video Retrieval (CoVR) is a specialized vision-language retrieval task that seeks to identify, within a large-scale video database, the target video whose content best matches a multi-modal composition: the visual content of a reference query video and a natural language modification describing the intended change. CoVR has evolved rapidly in both methodology and benchmarking, and now encompasses challenges spanning fine-grained spatial alignment, temporal reasoning, dense captioning, modality fusion, and practical scalability.

1. Formal Task Definition and Problem Setup

CoVR is formally specified as follows: Given a reference video qq and a modification text tt, the system searches a gallery V={vi}V = \{v_i\} for a target video vv^* that conforms to both the content in qq and the change specified by tt. The canonical objective is to learn a pair of encoders—a composed query encoder ff and a video encoder gg—such that

f(q,t)g(v)f(q, t) \approx g(v^*)

when vv^* is the desired video. Retrieval is performed by scoring candidates:

v=argmaxvV sim(f(q,t),g(v))v^* = \arg\max_{v \in V}\ \mathrm{sim}(f(q, t), g(v))

where sim(,)\mathrm{sim}(\cdot, \cdot) is typically cosine similarity.

Variants extend the query to include detailed language description dd of qq, yielding enriched multi-modal composition f(q,d,t)f(q, d, t) (Thawakar et al., 2024). Datasets used include WebVid-CoVR, where each triplet (q,t,v)(q, t, v^*) captures a distinct composition (Ventura et al., 2023), and Dense-WebVid-CoVR, which employs human-verified, lengthy modification texts and descriptions for fine-grained semantic modeling (Thawakar et al., 19 Aug 2025).

2. Model Architectures and Fusion Strategies

Several architectural principles underlie CoVR frameworks:

  • Frozen Visual and Text Encoders: Common backbones are ViT-L (visual) and BERT-base transformer (text). Frames are sampled from videos (typically $15$ for WebVid-CoVR).
  • Multi-modal Fusion Modules:
    • Pairwise Fusion: Earlier work fuses visual embeddings g(q)g(q) and text tt via cross-attention (Ventura et al., 2023).
    • Three-way Fusion: Recent frameworks sum embeddings from f(q,t)f(q, t), f(q,d)f(q, d), and f(e(d),t)f(e(d), t), with ee the text encoder (Thawakar et al., 2024).
    • Unified Cross-Attention Encoder: A single transformer block fuses (q,d,t)(q, d, t) via cross-attention, outperforming pairwise fusion strategies (Thawakar et al., 19 Aug 2025).
    • Hierarchical Alignment: Holistic and atomistic components capture global and fine-grained cross-modal interactions; Q-Former and uncertainty modeling resolve pronoun references and small-object ambiguities (Chen et al., 2 Dec 2025).
    • LoRA-Augmented MLLM: Shared multimodal LLM backbone with low-rank adaptation supports corpus-level, moment-level, and composed queries in a unified space (Halbe et al., 17 Jan 2026).
    • Multi-stage Cross-Attention: X-Aligner progressively fuses caption, visual, and text editing signals, maintaining pretrained VLM representations (Zheng et al., 23 Jan 2026).
    • PREGEN Pooling: Extraction and pooling of hidden states across all VLM layers enables compact compositional embedding, surpassing previous state-of-the-art (Serussi et al., 20 Jan 2026).

3. Training Objectives, Embedding Alignment, and Loss Functions

Most CoVR systems employ contrastive learning over triplets.

  • Hard-Negative InfoNCE Loss: Batches are used to generate contrastive pairs, with Si,jS_{i, j} scores computed for each query-target embedding pair. The loss is typically:

LHN=iBlog(exp(Si,i/τ)exp(Si,i/τ)+jiwi,jexp(Si,j/τ))\mathcal{L}_{\mathrm{HN}} = -\sum_{i \in \mathcal{B}} \log\left(\frac{\exp(S_{i, i}/\tau)}{\exp(S_{i, i}/\tau) + \sum_{j \neq i} w_{i, j} \exp(S_{i, j}/\tau)}\right)

where wi,jw_{i, j} denotes hard-negative weights, and τ\tau is the temperature (Thawakar et al., 2024).

  • Multi-target Contrastive Loss: Embeddings aligned to three databases—vision-only, text-only, and vision-text—for enhanced discrimination. Learned loss weights (~0.83, 0.08, 0.07) optimize alignment (Thawakar et al., 2024).
  • Generalized Contrastive Learning (GCL): Unified loss formulation across all modality pairs within a batch; image, text, and fused modalities are simultaneously optimized, reducing modality gaps (Lee et al., 30 Sep 2025).
  • Hierarchical Alignment and Regularization: Holistic-to-atomistic similarity distributions are regularized by KL divergence to ensure semantic coherence (Chen et al., 2 Dec 2025).
  • PREGEN Pooling: Layerwise hidden states from frozen VLM, aggregated using a lightweight transformer encoder and MLP, form highly semantic representations for retrieval (Serussi et al., 20 Jan 2026).

4. Benchmark Datasets and Evaluation Protocols

Key CoVR datasets and their protocols include:

Dataset # Triplets Description Type Modification Length Target Evaluation
WebVid-CoVR 1.6M Short captions 4.8 words Manual curation, R@K
Dense-WebVid-CoVR 1.6M 81-word description 31 words Human-verified, R@K
EgoCVR 2,295 Egocentric action vids 1.2 GT/query Temporal, Recall@K, Local
TF-CoVR 180K Gymnastics/diving 2–19 words Multi-GT, mAP@K

Recall@K is the primary metric for rank-based evaluation. TF-CoVR employs mean Average Precision at cutoff K (mAP@50), favoring robust multi-target retrieval (Gupta et al., 5 Jun 2025).

5. Empirical Results and Ablation Insights

Recent advances yield significant improvements in R@1 across benchmarks:

Model / Approach Dataset R@1 (%) Reference
Baseline CoVR-BLIP WebVid-CoVR 53.13 (Ventura et al., 2023)
Enriched Context + Multi-target WebVid-CoVR 60.12 (Thawakar et al., 2024)
Dense Description + Unified CA Fusion Dense-WebVid-CoVR 71.26 (Thawakar et al., 19 Aug 2025)
X-Aligner (BLIP-2 variant) WebVid-CoVR-Test 63.93 (Zheng et al., 23 Jan 2026)
HUD (Holistic/Atomistic) WebVid-CoVR 63.38 (Chen et al., 2 Dec 2025)
PREGEN (Qwen2.5-VL 7B) WebVid-CoVR 99.73 (Serussi et al., 20 Jan 2026)
VIRTUE-Embed 7B WebVid-CoVR 55.49 (ZS) (Halbe et al., 17 Jan 2026)
GCL (VISTA backbone) CoVR Benchmark 37.52 (ZS) (Lee et al., 30 Sep 2025)

Ablation studies show that combining visual, text, and description signals raises recall from ~27–41% (single modality) to above 60%, and dense, human-generated descriptions yield further gains. Unified cross-attention delivers +3.4% over pairwise fusion (Thawakar et al., 19 Aug 2025). PREGEN's layerwise pooling over all VLM layers achieves a recall nearly at the theoretical maximum for curated data (Serussi et al., 20 Jan 2026).

6. Challenges, Extensions, and Open Problems

Multiple axes of complexity drive current research:

  • Fine-grained Compositionality: Subtle actions, spatial region selection, temporal order, and pronoun reference require hierarchical fusion, uncertainty modeling, and cross-modal interaction modules (Chen et al., 2 Dec 2025).
  • Dense Captioning and Modification: Longer, multi-sentence modifications and detailed video descriptions are necessary for fine semantic control (Thawakar et al., 19 Aug 2025).
  • Temporal Reasoning: Benchmarks such as EgoCVR and TF-CoVR emphasize retrieving segments based on subtle action, duration, and event changes. Motion-sensitive video encoders and action-class pretraining are critical (Gupta et al., 5 Jun 2025, Hummel et al., 2024).
  • Zero-shot and Cross-domain Generalization: Transfer to image retrieval (CoIR), multi-modal retrieval, and textual-only or frame-only queries demonstrates flexibility. Strategies include synthetic triplet generation, pseudo-labeling, and knowledge integration (Zhang et al., 3 Mar 2025).
  • Scalability and Training Efficiency: Frameworks using frozen LLM backbones, LoRA, and lightweight adapters substantially reduce training cost without sacrificing accuracy (Halbe et al., 17 Jan 2026, Serussi et al., 20 Jan 2026).
  • Annotation and Data Quality: Auto-generated triplets are noisy (~22% discarded in WebVid-CoVR), requiring filtering and high-quality captioning tools. Dense-WebVid-CoVR addresses this via human verification (Thawakar et al., 19 Aug 2025).

7. Outlook and Future Directions

Current trends point to several research frontiers:

  • End-to-end video LLMs for compositional editing and retrieval, capturing interaction between audio, text, and vision (Zheng et al., 23 Jan 2026);
  • Interactive CoVR systems capable of multi-turn refinement and iterative query enhancement;
  • Joint retrieval and moment localization, grounding composed queries in both retrieval ranking and exact segment boundaries;
  • Expansion of fine-grained CoVR beyond appearance-centric or egocentric domains into high-motion, multi-agent domains;
  • Unified representations spanning images, videos, and text using generalized contrastive objectives and cross-modal learning (Lee et al., 30 Sep 2025).

Recent empirical findings confirm state-of-the-art performance from PREGEN (Serussi et al., 20 Jan 2026), extensive gain from enriched context and discriminative alignment (Thawakar et al., 2024), robust temporal handling via TF-CoVR-Base (Gupta et al., 5 Jun 2025), and new scaling pathways from LoRA-based models (Halbe et al., 17 Jan 2026). CoVR continues to serve as the core methodological bridge uniting structured compositional search with scalable multimedia understanding in modern video retrieval systems.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Composed Video Retrieval (CoVR).