Generalized Visual Grounding (GVG)

Updated 25 January 2026

GVG is a task that localizes variable numbers of regions in visual data based on natural language queries, accommodating multi-target and no-target scenarios.
Recent methodologies leverage transformer decoders, set-prediction losses, and multimodal architectures to integrate detection, segmentation, and reasoning.
Benchmarking protocols like precision@F1 and no-target accuracy, along with datasets such as gRefCOCO and Ref-ZOM, drive evaluation in GVG research.

Generalized Visual Grounding (GVG) is the overarching task of localizing a variable number of regions—potentially zero, one, or multiple—in images or videos based on a natural language expression. Unlike classical visual grounding, which assumes exactly one referent per query, GVG enables broader, more realistic understanding of language-vision referential acts, covering multi-target, no-target, open-vocabulary, multimodal, and reasoning-driven scenarios. Advances in GVG span architectures for single- and multi-image input, 3D scenes, video streams, and more, with a rigorous benchmarking and methodological ecosystem establishing GVG as a central multimodal vision-language task.

1. Formalization and Motivation

Let $\mathcal{I}\in\mathbb{R}^{3\times H\times W}$ denote an input image (or video frame) and $\mathcal{T}$ a tokenized query of length $L_t$ . A GVG model $\mathcal{M}_g$ predicts a set of bounding boxes

$\hat{\mathbf B}\;=\;\bigl\{\;\hat{\mathscr B}_i\bigr\}_{i=1}^k\;=\;\mathcal M_g(\mathcal I,\mathcal T)$

with each $\hat{\mathscr B}_i=(\hat x_i,\hat y_i,\hat w_i,\hat h_i)$ , where $k\in\{0,1,2,\dots\}$ is unconstrained. In contrast, classical referring expression comprehension (REC) restricts $k=1$ .

The motivation for GVG is to accommodate the full range of referential phenomena encountered in natural language and real-world tasks, including:

Single-object grounding: standard REC/phrase grounding;
Multi-object grounding: e.g., “the two men on the left” yielding $k>1$ ;
No-object/negative grounding: e.g., “the unicorn in the scene” when no unicorn is present, requiring $k=0$ .

This shift from fixed $k=1$ to flexible $k\geq0$ is driven by use-cases such as surveillance, human–robot interaction, instructional dialog, and open-domain visual search, where expressions may correspond to multiple or no regions (Xiao et al., 2024).

2. Evaluation Protocols and Benchmarks

GVG employs evaluation protocols explicitly designed for variable $k$ . Two primary metrics are established:

Precision@( $F_1$ =1, IoU $\geq$ 0.5): For a prediction to count as correct, the set of predicted boxes must exactly match the ground truth, measured by bipartite IoU matching and $F_1$ score of 1.0 (i.e., $\mathrm{TP}$ , $\mathrm{FP}$ , $\mathrm{FN}$ all properly accounted for at the instance level).
No-target accuracy (N-acc): The fraction of true no-target queries for which no box is predicted.

Datasets specifically designed for GVG include:

gRefCOCO: 278,232 queries over 19,994 COCO images; mixture of single-target, multi-target, and no-target cases.
Ref-ZOM: 90,199 queries over 55,078 images, with detailed control over target cardinality.
D³: 422 challenging high-res multi-target and no-target queries (Xiao et al., 2024).

For videos, the GVG dataset (distinct from gRefCOCO) targets spatio-temporal localization of multiple tubes per query, with challenging distractor and multi-segment cases (Feng et al., 2021).

3. Core Methodologies and Model Design

Set-Prediction Transformers and Query-Based Decoders

GVG architectures must produce a flexible number of outputs. Accordingly, recent models adopt Transformer decoders with multiple queries, Hungarian (bipartite) matching, and set-prediction losses:

InstanceVG: Implements generalized referring expression comprehension (GREC), segmentation (GRES), and instance-aware prediction by tying each decoder query to a point, box, and mask, enforcing 1:1:1 consistency via Hungarian matching and multi-task loss (Dai et al., 17 Sep 2025).
MDETR/UNINEXT/SimVG: Extended to handle $k\geq0$ by using multiple prediction queries, set-based L1+GIoU loss, and a confidence threshold to control output count (Xiao et al., 2024).

Multimodal LLMs (MLLMs) and Chain-of-Thought

VGR (Visual Grounded Reasoning) introduces an MLLM with chain-of-thought reasoning that issues explicit <sot>[x1,y1,x2,y2]<eot> replay signals to extract and attend to fine-grained visual regions as the model generates intermediate reasoning steps. This design integrates region selection into the CoT process and yields state-of-the-art results on tasks requiring comprehensive image detail understanding, while being highly token-efficient (Wang et al., 13 Jun 2025).

Zero-/Few-Shot and Open-Vocabulary Grounding

GroundVLP: Achieves zero-shot GVG by fusing GradCAM heatmaps from vision-language pre-training (VLP) with open-vocabulary detector proposals, using only generic cross-modal supervision and without any task-specific finetuning (Shen et al., 2023).
IMAGE: Introduces adaptive masking (Gaussian radiation modeling on feature maps) into the vision backbone, enhancing generalization and zero/few-shot transfer without scaling data size or architecture (Jia et al., 2024).
Context Disentangling/Prototype Inheriting: TransCP disentangles referent/context features and uses a prototype bank for transfer to novel categories, fusing modalities via Hadamard product (Tang et al., 2023).

Multi-image, Spatiotemporal, and 3D GVG

Generalized Multi-image Grounding: GeM-VG formalizes groundings across image groups, introducing a taxonomy of retrieval, association, and reasoning types (semantic, temporal, spatial), hybrid RL-SFT optimization (for both direct and CoT outputs), and a large dataset, MG-Data-240K. This supports k-variable per image, per group, and across cross-image relations (Zheng et al., 8 Jan 2026).
Video GVG: DSTG constructs decoupled spatial and temporal graphs per video, with contrastive routing to address appearance/motion distractors, and is evaluated on general video datasets with multi-object and multi-clip expressions (Feng et al., 2021).
3D and Multimodal Input: PanoGrounder bridges 2D VLMs and 3D scenes with panoramic renderings, semantic/geometric feature injection, and multi-view fusion, supporting robust GVG under varied viewpoints and language (Jung et al., 24 Dec 2025). RGBT-Ground expands datasets to RGB-TIR pairs, addressing robustness under night, small-object, or adverse weather settings (Zhao et al., 31 Dec 2025).

4. Joint Learning, Consistency, and Instance Awareness

Advanced GVG models increasingly unify multiple grounding tasks (detection, segmentation), modalities (RGB/TIR, text), and instance-level outputs:

InstanceVG: Imposes joint supervision and tight 1:1:1 point–box–mask alignment, outperforms prior methods on GREC/GRES and related benchmarks, and supports multi- and non-target expressions (Dai et al., 17 Sep 2025).
Cross-task regularization and negative-query suppression further improve the model’s ability to balance background/foreground, suppress spurious outputs, and yield consistent multi-granularity predictions.

5. Performance, Challenges, and Benchmarks

Model performance in GVG, as of 2025–2026, demonstrates a clear gap to single-object grounding:

Top multi-target precision@( $F_1$ =1, IoU $\geq$ 0.5) is 40–60% on gRefCOCO/Ref-ZOM, while classical REC can achieve 90%+.
No-target accuracy (N-acc) now routinely exceeds 80% for best models.
InstanceVG and GeM-VG are current leaders in segmentation, multi-image, and multi-object metrics (Dai et al., 17 Sep 2025, Zheng et al., 8 Jan 2026).

However, challenges persist. GVG datasets remain modest in scale; evaluation protocols may require further refinement (e.g., varying k-recall, error analysis beyond all-or-none F1); and many algorithmic advances are incremental extensions of single-object architectures. Streaming, long-horizon video, and open-world adaptation are still open problems (Xiao et al., 2024).

6. Future Directions and Open Problems

The field is expected to evolve in several directions:

Larger, high-quality datasets covering diverse target counts, modalities, and negative cases, with rigorous “no-leak” splits and standardized protocols.
Unified models for robust, token-efficient, and truly universal GVG, tightly integrating detection, segmentation, spatiotemporal tracking, and language reasoning as first-class objectives.
Algorithmic advances in instance-aware learning, set prediction, proposal-free architectures, and dynamic output control.
Evaluation metrics reflecting practical needs, with error diagnosis for both multi-target and false positive/negative cases.
New paradigms in self-supervised region-level or masking-based pre-training adapted for k-variable grounding and sparse supervision.

The synthesis of insights from rich pre-trained multimodal models, scene- and instance-level joint learning, and open-world transfer will define the next generation of GVG research (Xiao et al., 2024, Dai et al., 17 Sep 2025, Zheng et al., 8 Jan 2026).