Papers
Topics
Authors
Recent
Search
2000 character limit reached

Generalized Visual Grounding (GVG)

Updated 25 January 2026
  • GVG is a task that localizes variable numbers of regions in visual data based on natural language queries, accommodating multi-target and no-target scenarios.
  • Recent methodologies leverage transformer decoders, set-prediction losses, and multimodal architectures to integrate detection, segmentation, and reasoning.
  • Benchmarking protocols like precision@F1 and no-target accuracy, along with datasets such as gRefCOCO and Ref-ZOM, drive evaluation in GVG research.

Generalized Visual Grounding (GVG) is the overarching task of localizing a variable number of regions—potentially zero, one, or multiple—in images or videos based on a natural language expression. Unlike classical visual grounding, which assumes exactly one referent per query, GVG enables broader, more realistic understanding of language-vision referential acts, covering multi-target, no-target, open-vocabulary, multimodal, and reasoning-driven scenarios. Advances in GVG span architectures for single- and multi-image input, 3D scenes, video streams, and more, with a rigorous benchmarking and methodological ecosystem establishing GVG as a central multimodal vision-language task.

1. Formalization and Motivation

Let IR3×H×W\mathcal{I}\in\mathbb{R}^{3\times H\times W} denote an input image (or video frame) and T\mathcal{T} a tokenized query of length LtL_t. A GVG model Mg\mathcal{M}_g predicts a set of bounding boxes

B^  =  {  B^i}i=1k  =  Mg(I,T)\hat{\mathbf B}\;=\;\bigl\{\;\hat{\mathscr B}_i\bigr\}_{i=1}^k\;=\;\mathcal M_g(\mathcal I,\mathcal T)

with each B^i=(x^i,y^i,w^i,h^i)\hat{\mathscr B}_i=(\hat x_i,\hat y_i,\hat w_i,\hat h_i), where k{0,1,2,}k\in\{0,1,2,\dots\} is unconstrained. In contrast, classical referring expression comprehension (REC) restricts k=1k=1.

The motivation for GVG is to accommodate the full range of referential phenomena encountered in natural language and real-world tasks, including:

  • Single-object grounding: standard REC/phrase grounding;
  • Multi-object grounding: e.g., “the two men on the left” yielding k>1k>1;
  • No-object/negative grounding: e.g., “the unicorn in the scene” when no unicorn is present, requiring k=0k=0.

This shift from fixed k=1k=1 to flexible k0k\geq0 is driven by use-cases such as surveillance, human–robot interaction, instructional dialog, and open-domain visual search, where expressions may correspond to multiple or no regions (Xiao et al., 2024).

2. Evaluation Protocols and Benchmarks

GVG employs evaluation protocols explicitly designed for variable kk. Two primary metrics are established:

  • Precision@(F1F_1=1, IoU\geq0.5): For a prediction to count as correct, the set of predicted boxes must exactly match the ground truth, measured by bipartite IoU matching and F1F_1 score of 1.0 (i.e., TP\mathrm{TP}, FP\mathrm{FP}, FN\mathrm{FN} all properly accounted for at the instance level).
  • No-target accuracy (N-acc): The fraction of true no-target queries for which no box is predicted.

Datasets specifically designed for GVG include:

  • gRefCOCO: 278,232 queries over 19,994 COCO images; mixture of single-target, multi-target, and no-target cases.
  • Ref-ZOM: 90,199 queries over 55,078 images, with detailed control over target cardinality.
  • D³: 422 challenging high-res multi-target and no-target queries (Xiao et al., 2024).

For videos, the GVG dataset (distinct from gRefCOCO) targets spatio-temporal localization of multiple tubes per query, with challenging distractor and multi-segment cases (Feng et al., 2021).

3. Core Methodologies and Model Design

Set-Prediction Transformers and Query-Based Decoders

GVG architectures must produce a flexible number of outputs. Accordingly, recent models adopt Transformer decoders with multiple queries, Hungarian (bipartite) matching, and set-prediction losses:

Multimodal LLMs (MLLMs) and Chain-of-Thought

VGR (Visual Grounded Reasoning) introduces an MLLM with chain-of-thought reasoning that issues explicit <sot>[x1,y1,x2,y2]<eot> replay signals to extract and attend to fine-grained visual regions as the model generates intermediate reasoning steps. This design integrates region selection into the CoT process and yields state-of-the-art results on tasks requiring comprehensive image detail understanding, while being highly token-efficient (Wang et al., 13 Jun 2025).

Zero-/Few-Shot and Open-Vocabulary Grounding

  • GroundVLP: Achieves zero-shot GVG by fusing GradCAM heatmaps from vision-language pre-training (VLP) with open-vocabulary detector proposals, using only generic cross-modal supervision and without any task-specific finetuning (Shen et al., 2023).
  • IMAGE: Introduces adaptive masking (Gaussian radiation modeling on feature maps) into the vision backbone, enhancing generalization and zero/few-shot transfer without scaling data size or architecture (Jia et al., 2024).
  • Context Disentangling/Prototype Inheriting: TransCP disentangles referent/context features and uses a prototype bank for transfer to novel categories, fusing modalities via Hadamard product (Tang et al., 2023).

Multi-image, Spatiotemporal, and 3D GVG

  • Generalized Multi-image Grounding: GeM-VG formalizes groundings across image groups, introducing a taxonomy of retrieval, association, and reasoning types (semantic, temporal, spatial), hybrid RL-SFT optimization (for both direct and CoT outputs), and a large dataset, MG-Data-240K. This supports k-variable per image, per group, and across cross-image relations (Zheng et al., 8 Jan 2026).
  • Video GVG: DSTG constructs decoupled spatial and temporal graphs per video, with contrastive routing to address appearance/motion distractors, and is evaluated on general video datasets with multi-object and multi-clip expressions (Feng et al., 2021).
  • 3D and Multimodal Input: PanoGrounder bridges 2D VLMs and 3D scenes with panoramic renderings, semantic/geometric feature injection, and multi-view fusion, supporting robust GVG under varied viewpoints and language (Jung et al., 24 Dec 2025). RGBT-Ground expands datasets to RGB-TIR pairs, addressing robustness under night, small-object, or adverse weather settings (Zhao et al., 31 Dec 2025).

4. Joint Learning, Consistency, and Instance Awareness

Advanced GVG models increasingly unify multiple grounding tasks (detection, segmentation), modalities (RGB/TIR, text), and instance-level outputs:

  • InstanceVG: Imposes joint supervision and tight 1:1:1 point–box–mask alignment, outperforms prior methods on GREC/GRES and related benchmarks, and supports multi- and non-target expressions (Dai et al., 17 Sep 2025).
  • Cross-task regularization and negative-query suppression further improve the model’s ability to balance background/foreground, suppress spurious outputs, and yield consistent multi-granularity predictions.

5. Performance, Challenges, and Benchmarks

Model performance in GVG, as of 2025–2026, demonstrates a clear gap to single-object grounding:

  • Top multi-target precision@(F1F_1=1, IoU\geq0.5) is 40–60% on gRefCOCO/Ref-ZOM, while classical REC can achieve 90%+.
  • No-target accuracy (N-acc) now routinely exceeds 80% for best models.
  • InstanceVG and GeM-VG are current leaders in segmentation, multi-image, and multi-object metrics (Dai et al., 17 Sep 2025, Zheng et al., 8 Jan 2026).

However, challenges persist. GVG datasets remain modest in scale; evaluation protocols may require further refinement (e.g., varying k-recall, error analysis beyond all-or-none F1); and many algorithmic advances are incremental extensions of single-object architectures. Streaming, long-horizon video, and open-world adaptation are still open problems (Xiao et al., 2024).

6. Future Directions and Open Problems

The field is expected to evolve in several directions:

  • Larger, high-quality datasets covering diverse target counts, modalities, and negative cases, with rigorous “no-leak” splits and standardized protocols.
  • Unified models for robust, token-efficient, and truly universal GVG, tightly integrating detection, segmentation, spatiotemporal tracking, and language reasoning as first-class objectives.
  • Algorithmic advances in instance-aware learning, set prediction, proposal-free architectures, and dynamic output control.
  • Evaluation metrics reflecting practical needs, with error diagnosis for both multi-target and false positive/negative cases.
  • New paradigms in self-supervised region-level or masking-based pre-training adapted for k-variable grounding and sparse supervision.

The synthesis of insights from rich pre-trained multimodal models, scene- and instance-level joint learning, and open-world transfer will define the next generation of GVG research (Xiao et al., 2024, Dai et al., 17 Sep 2025, Zheng et al., 8 Jan 2026).

Topic to Video (Beta)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Generalized Visual Grounding (GVG).