OV-STVG: Open-Vocabulary Video Grounding

Updated 5 February 2026

OV-STVG is defined as localizing both when and where events occur in video streams using unconstrained natural language queries.
It employs advanced cross-modal architectures that combine video and text encoders with transformer-based fusion for joint spatial and temporal reasoning.
Empirical evaluations reveal notable gains in m_vIoU and pointing game accuracy, demonstrating effective open-vocabulary generalization across diverse benchmarks.

Open-Vocabulary Spatio-Temporal Video Grounding (OV-STVG), as formalized in recent literature, is the task of localizing both when and where entities or events occur in video streams in response to unconstrained natural language queries. This setting generalizes traditional spatio-temporal grounding by discarding reliance on closed vocabularies and category-specific detectors, enabling models to interpret and localize novel concepts or relational cues at inference through robust cross-modal reasoning (Wasim et al., 2023).

1. Task Definition and Evaluation Protocols

OV-STVG accepts as input an untrimmed video $V\in\mathbb{R}^{T\times H\times W\times C}$ and a free-form text query $q$ , with the objective of outputting (i) a temporal segment $(t_s, t_e)$ , and (ii) a tube, i.e., a sequence of bounding boxes $\{B^t = (x^t, y^t, w^t, h^t)\}_{t=t_s}^{t_e}$ , localizing the referent described by $q$ across time.

Formally, the mapping is: $f_\theta: (V, q) \longmapsto \bigl(t_s, t_e, \{B^t\}_{t=t_s}^{t_e}\bigr)$ Evaluation is performed via mean video-IoU (m_vIoU), calculated as: $\mathrm{m\_vIoU} = \frac{1}{N}\sum_{i=1}^N \frac{1}{t_e^i-t_s^i+1} \sum_{t=t_s^i}^{t_e^i}\mathrm{IoU}(\hat B^t_i, B^t_i)$ and, when applicable, pointing-game accuracy for single-frame grounding: $\mathrm{Acc} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}\bigl(\mathrm{dist}(\hat p_i,p_i)\le\delta\bigr)$ Where $N$ is the number of test queries (Wasim et al., 2023, Wang et al., 18 Mar 2025).

2. Model Architectures for OV-STVG

Recent OV-STVG approaches integrate video encoders, text encoders, spatio-temporal fusion, and cross-modality attention in various configurations:

Video-GroundingDINO (Wasim et al., 2023): Composed of a frozen Swin-Transformer vision encoder and BERT text encoder, fused by multi-scale deformable DETR layers from Grounding DINO. Temporal and spatial fusion is achieved via stacked multi-head self-attention (MHSA), followed by cross-modal attention blocks. The model selects language-guided visual queries, which propagate through a spatio-temporal transformer decoder to jointly regress bounding boxes and temporal intervals via separate prediction heads.
SpaceVLLM (Wang et al., 18 Mar 2025): Utilizes interleaved spatio-temporal-aware queries inserted between visual tokens. A SigLIP visual encoder and a BPE-tokenized LLM process both modalities, fused by cross-attention blocks to compute joint frame-level and query-level representations. A query-guided space decoder regresses box coordinates. Its design supports arbitrary queries, with no fixed vocabulary: spatial and temporal localization are supervised by multi-task objectives on synthetic large-scale corpus (Uni-STG), covering temporal-only, spatial-only, and spatio-temporal tasks.
Chain-of-Thought Bounding Box Generation (Gu et al., 26 Nov 2025): Multimodal LLMs are prompted to generate <time>, <think_bbox>, and <pred_bbox> blocks, enabling explicit intermediate reasoning about possible object locations in each frame before producing final tubes. Reinforcement fine-tuning aligns autoregressive generation with localization rewards tailored to format, consistency, temporal, and spatial accuracy.
Detector-Empowered Video LLM (DEViL) (Gao et al., 7 Dec 2025): Couples a multi-modal LLM with an open-vocabulary detector via a reference-semantic token (RST), which serves as a differentiable bridge for propagating referential semantics. The detector, guided by RST-projected embeddings, produces temporally associated tubes via memory-based matching and tube-mined temporal regularization (TTReg).

These models advance beyond closed-vocabulary limitations, enabling localization and tracking of entities and actions referenced in free-form text.

3. Training Strategies and Supervisory Signals

Foundational OV-STVG models employ various training regimes:

Fine-tuning: Adapter layers or fusion modules are fine-tuned atop frozen representations, as in Video-GroundingDINO. Losses include weighted $L_1$ and GIoU for spatial grounding, and KL-divergence for temporal interval prediction (Wasim et al., 2023).
Multi-task curriculum: SpaceVLLM jointly supervises temporal, spatial, and video QA via mixed datasets and synthetic tubes grounded by a combination of LLM analysis and open-vocabulary detectors (Wang et al., 18 Mar 2025).
Reinforcement learning: Chain-of-thought models (e.g. STVG-o1) optimize generation trajectories by policy gradient on geometry-aware rewards, improving region-word alignment and stepwise object reasoning (Gu et al., 26 Nov 2025).
End-to-end tube regularization: DEViL applies temporal regularization losses to ensure cross-frame feature and geometric stability in tube predictions (Gao et al., 7 Dec 2025).

Synthetic corpus construction and self-supervised multimodal alignment—such as in GroundingYouTube (Chen et al., 2023)—further enable open-vocabulary generalization using noisy ASR-aligned subtitles, even with minimal human annotation.

4. Datasets and Benchmarking

Key spatio-temporal video grounding datasets include:

VidSTG (Zhang et al., 2020): 99,943 multi-form sentences (declarative/interrogative) over 6,924 videos, leveraging object-relation triplets; supports evaluation on both named and unknown objects.
HC-STVG (V1, V2): Human-centric grounding tasks.
YouCook-Interactions, Charades-STA, RefCOCO family: Datasets for spatial, temporal, and joint spatio-temporal grounding tasks.
Uni-STG (SpaceVLLM) (Wang et al., 18 Mar 2025): Unified corpus of temporal, spatial, and joint grounding, synthesized with query-object analysis and open-vocabulary spatial annotation.
GroundingYouTube (Chen et al., 2023): Densely annotated untrimmed cooking videos, facilitating zero-shot “what,” “when,” and “where” evaluation with open textual queries.

Metrics vary by task: m_vIoU, [email protected]/0.5 thresholds, temporal IoU (tIoU), region similarity ( $q$ 0), contour accuracy ( $q$ 1), recall@IoU, and pointing game accuracy.

5. Open-Vocabulary Mechanisms and Generalization

OV-STVG methods leverage compositional, free-form queries and alignment strategies:

Tokenization and BPE or word2vec embeddings enable arbitrary language at inference.
Cross-modal attention aligns phrase semantics (attributes, actions, spatial cues) with visual tokens.
Prompting strategies decompose queries into referent and action sub-queries, enabling attribute- and action-guided highlighting (Yang et al., 18 Sep 2025).
Specialized tokens (e.g., grounding tokens, RSTs, <SEG> markers) act as soft or explicit spatial prompts, activating appropriate reasoning or decoding modules during inference.
Model transfer: Training on one dataset (e.g., VidSTG) allows zero-shot generalization to others with unseen objects or instructions, demonstrating robust open-vocabulary performance (Gu et al., 26 Nov 2025).

6. Empirical Results and Performance Analysis

State-of-the-art OV-STVG methods surpass prior baselines on multiple metrics and diverse settings:

Video-GroundingDINO: m_vIoU=27.46 on HC-STVG V1 (zero-shot) vs. 22.58 for best prior; 57.73% on YouCook-Interactions pointing game (Wasim et al., 2023).
SpaceVLLM: m_vIoU=39.3 on HCSTVG-v1, 27.4 on VidSTG declarative, outperforming previous models on 11 benchmarks (Wang et al., 18 Mar 2025).
STVG-o1: m_tIoU=60.3%, m_vIoU=44.1% on HCSTVG-v1, matching or exceeding domain-specific and LLM-based SOTAs (Gu et al., 26 Nov 2025).
DEViL: m_tIoU=54.7/58.0 and m_vIoU=36.2/36.5 on HC-STVG v1/v2 (zero-shot), up to 32.0/27.7 on VidSTG declarative/interrogative (Gao et al., 7 Dec 2025).

Ablations demonstrate:

Query and decoder interleaving, dual cross-attention, and chain-of-thought prompt structures contribute significant gains.
Tube-mined regularization and memory association mitigate error drift across long sequences.

7. Future Directions and Limitations

Contemporary OV-STVG models exhibit strong but non-exhaustive open-vocabulary capabilities:

Most work addresses single-target queries; multi-entity, relational, and crowded-scenario grounding remains a challenge (Gao et al., 7 Dec 2025).
Segment-wise and pixel-wise (mask) grounding are less explored compared to bounding-box trajectories.
Very long videos and fine-grained events are bottlenecked by fixed query counts, sparse frame sampling, and computational overhead.
Expanding beyond fixed predicate vocabularies and integrating audio or external temporal logic are active areas of research (Zhang et al., 2020, Wang et al., 18 Mar 2025).

Methods continue to evolve towards end-to-end, efficient, and fully open-vocabulary video understanding, with increasing modularity to accommodate new tasks and modalities. Researchers are pursuing improved temporal reasoning, adaptive frame selection, tube-level decoding for coherence, and further leveraging self-supervised or synthesized corpora for domain-agnostic zero-shot generalization.