Video-ColBERT: Fine-Grained Text-to-Video Retrieval
- The paper presents Video-ColBERT, a bi-encoder model employing dual-level token matching and dual sigmoid loss to boost text-to-video retrieval performance.
- It integrates query and visual expansion tokens with a MeanMaxSim operator to perform efficient, context-aware similarity computations at both frame and video levels.
- Experimental results show significant improvements in Recall@1 and indexing efficiency over traditional methods on standard video retrieval benchmarks.
Video-ColBERT is a bi-encoder, late-interaction framework for the text-to-video retrieval (T2VR) task, which adapts the ColBERT model’s fine-grained token-level matching paradigm from the text and image retrieval domains to the domain of video. Video-ColBERT employs independent encoding for the query and video, followed by efficient and contextualized token-wise similarity computation using a MeanMaxSim (MMS) operator at both spatial (frame-level) and spatio-temporal (video-level) resolutions. The model integrates learned soft query and visual expansion tokens and utilizes a dual sigmoid-based training objective, enabling significantly improved retrieval accuracy and storage/indexing efficiency compared to traditional bi-encoder and cross-encoder methods (Reddy et al., 24 Mar 2025).
1. Architecture and Core Mechanisms
Video-ColBERT’s architecture comprises three principal components:
- Fine-grained spatial and temporal token-wise interaction, facilitated by the MMS operator at both frame and contextualized-video levels.
- Query and visual expansions: query-side augmentation with soft pad tokens and video-side expansion tokens fed into the temporal transformer.
- Dual sigmoid-based loss applied to both frame-level and video-level representations for complementary supervision.
Text queries are tokenized into tokens , which are encoded using a text transformer (e.g., CLIP’s text tower) to yield . Videos are sampled into frames, each processed by an image encoder (e.g., CLIP’s ViT), producing frame [CLS] embeddings . These frame embeddings are passed through a lightweight, 4-layer temporal transformer to obtain temporally contextualized video embeddings .
The core similarity computation leverages two MMS scores:
with denoting normalized dot products. The final similarity is . This dual-level interaction allows the model to localize static cues via and dynamic, cross-frame context via .
2. Query and Visual Expansions
Query expansion in Video-ColBERT follows the ColBERT practice of padding each query with additional tokens (e.g., to length 32 for CLIP; 64 for SigLIP). These pad tokens are included in transformer self-attention, allowing the model to learn "soft search terms" that augment the original query semantics. At inference, pad token embeddings contribute to the MMS computation, providing adaptive query enrichment.
On the video side, learnable visual expansion tokens are prepended to the input of the temporal transformer. After temporal encoding, the output embeddings for these expansion tokens are included in the set used in , functioning as dynamic, localized prototypical representations.
Illustrative pseudocode:
1 2 3 4 |
V_emb = TemporalTransformer(F_emb) score_F = (1/M) * sum_j max_i (Q_emb[j] · F_emb[i]) score_V = (1/M) * sum_j max_i (Q_emb[j] · V_emb[i]) return score_F + score_V |
3. Dual Sigmoid Loss and Training Dynamics
Video-ColBERT uses a dual sigmoid-based pairwise classification loss, as opposed to bi-directional InfoNCE or softmax-based alternatives. For a batch of size , the losses are:
with for matching pairs, for non-matching; and are learned logit scale and bias, respectively. This formulation encourages the frame- and video-level branches to be separately discriminative and stable, while circumventing the need for global normalization or large batches of hard negatives.
Ablations indicate that the dual sigmoid loss yields the best retrieval accuracy (Recall@1), outperforming both InfoNCE and combined sigmoid approaches.
4. Experimental Results and Comparative Performance
Video-ColBERT has been evaluated on standard sentence-to-video and paragraph-to-video retrieval benchmarks, using variants of the CLIP (ViT-B/32, ViT-B/16) and SigLIP backbones, with consistently strong results.
| Model | MSR-VTT R@1 | MSVD R@1 | DiDeMo R@1 | VATEX R@1 | ActivityNet R@1 |
|---|---|---|---|---|---|
| CLIP4Clip-meanP (baseline) | 43.1 | 46.2 | — | — | — |
| DRL (prev. best) | 47.4 | 48.3 | — | — | — |
| Video-ColBERT-CLIP-B/32 | 48.1 | 46.0 | 48.2 | — | — |
| Video-ColBERT-CLIP-B/16 | 51.0 | — | 51.9 | — | — |
| Video-ColBERT-SigLIP-B/16 | 51.5 | — | — | — | — |
Additional metrics (Recall@5, Recall@10, nDCG@10) and results on VATEX, ActivityNet further indicate consistent gains. On MSR-VTT with CLIP-B/16, Video-ColBERT achieves 51.0 Recall@1, and 51.5 Recall@1 with SigLIP-B/16.
Ablation studies reveal that combining both MMS (frame) and MMS (video) levels gives the highest performance—e.g., on MSR-VTT (CLIP-B/32), MMS only yields 44.3 R@1, MMS only 47.0, but both combined yield 48.1. Use of mean pooling alone is less effective (42.1). These results support the claim that fine-grained, dual-level late interaction is key to performance (Reddy et al., 24 Mar 2025).
5. Computational Efficiency and Practical Considerations
Video-ColBERT maintains efficient retrieval via independent encoding towers and dot-product based late interaction, avoiding cross-attention at inference. Indexing a video (forward and storage for both feature levels) takes approximately 9.64 ms (CLIP-B/32), and query latency for computing both MMS and MMS is ~11.2 ms on an A5000 GPU. Storage requirements increase due to the need for two token collections per video; however, standard compression methods (e.g., Product Quantization) can be used to manage scale.
The model’s design allows for scalable nearest-neighbor search with tools such as FAISS, IVF, or HNSW, since all interactions are based on dot-products. This offers a significant efficiency advantage over cross-encoder approaches, especially for large-scale video databases.
6. Implementation Details and Hyperparameters
- Backbones: CLIP ViT-B/32, ViT-B/16, SigLIP ViT-B/16.
- Temporal Transformer: 4 layers, hidden dimension matching the backbone (e.g., 512 or 768).
- Expansion Tokens: 2 visual expansion tokens; query padding to 32 (CLIP) or 64 (SigLIP).
- Optimizer: Adam (=0.9, =0.98, =1e-6), weight decay=0.01, gradient clipping=1.
- Learning Rates: 1e-7 (base encoders), 1e-4 (temporal transformer), 10% linear warmup.
- Frame Sampling: 12 frames for MSR-VTT, MSVD, VATEX; 64 for DiDeMo, ActivityNet; resize to 224×224, no center-crop.
- Batch/Epochs: E.g., CLIP-B/32: batch=256, epochs=5 (10 for VATEX); DiDeMo, ActivityNet: batch=64, epochs=20.
7. Conclusions and Prospective Directions
Video-ColBERT achieves state-of-the-art performance in text-to-video retrieval benchmarks using a modest computational budget. Its principal contributions are:
- The hierarchical use of late interaction—both spatial (static) and temporal (contextualized)—which enables the model to capture complementary visual/textual cues.
- Learned query and video expansion strategies, enabling flexible and adaptive retrieval behavior.
- A dual sigmoid loss regime, which empirically provides stable, discriminative training signals for both representation branches.
Potential directions for advancement include compressing the dual-level index (e.g., via PQ/quantization), developing richer temporal modeling architectures (such as hierarchical or memory-augmented transformers), introducing cross-modal reranking on top of Video-ColBERT, and expanding to open-domain or zero- and few-shot video retrieval scenarios without additional finetuning (Reddy et al., 24 Mar 2025).