Papers
Topics
Authors
Recent
Search
2000 character limit reached

Generalized Visual Grounding (GREC)

Updated 29 January 2026
  • GREC is the task of localizing any number of objects in visual scenes using free-form expressions, covering single-target, multi-target, and no-target scenarios.
  • It extends classic referring expression comprehension by introducing variable cardinality, open-vocabulary understanding, and exact set-level matching in evaluation.
  • Key innovations include multi-modal model architectures, count-aware mechanisms, and dialogue as well as multi-image strategies to enhance grounding accuracy and applicability.

Generalized Visual Grounding (GREC) is the task of localizing an arbitrary number of objects in complex visual scenes according to a free-form expression, encompassing single-target, multi-target, and no-target referents. This setting generalizes classical Referring Expression Comprehension (REC), which is restricted to one target per query, by introducing variable cardinality, open-vocabulary comprehension, and robust absent-target handling. GREC is formally defined as a mapping from images and textual expressions to sets of bounding boxes, where the number of outputs is unknown and may be zero; evaluation focuses on exact matching of predicted sets to ground-truth, with metrics tailored to account for the challenges of multi-object and no-object cases (He et al., 2023, Ding et al., 8 Jan 2026, Xiao et al., 2024).

1. Formal Definition, Taxonomy, and Evaluation

GREC generalizes the classic REC as follows:

  • Inputs: One or more images V={Vi}i=1mV = \{V_i\}_{i=1}^m, each Vi∈RH×W×3V_i \in \mathbb{R}^{H \times W \times 3}; a natural-language query TT.
  • Outputs: A set of nn grounded instances O={(bk,ik)}k=1n=M(V,T)O = \{(b_k, i_k)\}_{k=1}^n = \mathcal{M}(V,T); each bk=(x1,y1,x2,y2)b_k = (x_1, y_1, x_2, y_2) is a bounding box, and ik∈{1,...,m}i_k \in \{1, ..., m\} denotes the source image.
  • Generalization: m=1m=1 recovers single-image REC; n>1n>1 allows multi-object; n=0n=0 supports absent-target/no-target cases (Zheng et al., 8 Jan 2026).

Taxonomies organize tasks by cross-image cue reliance (referential, semantic association, temporal association, spatial association, reasoning) and image relationships (same scene, multi-view, multi-frame, disjoint semantic composition) (Zheng et al., 8 Jan 2026).

Evaluation Metrics:

  • Precision@(F1=1, IoU≥\ge0.5): Fraction of samples where the predicted box set matches ground truth perfectly (all pairs IoU≥0.5\mathrm{IoU} \ge 0.5 and set sizes agree) (Ding et al., 8 Jan 2026).
  • No-target accuracy (N-acc): Fraction of absent-target samples with zero predictions.
  • Traditional [email protected]–.95: Used for comparison, but not ideal for GREC because set-level matching is crucial (He et al., 2023, Ding et al., 8 Jan 2026).

2. Datasets for GREC and Multi-image Grounding

The principal annotated resource for GREC is gRefCOCO, which extends RefCOCO with multi- and no-target expressions:

Dataset #Images #Expressions Single Multi None
gRefCOCO 19,994 ≈259,859 ~135k ~90k ~35k

gRefCOCO features strict validator-rewrite for ambiguous or ungroundable expressions; annotation covers free-form, multi-target, compound, and "no referent" queries (He et al., 2023, Ding et al., 8 Jan 2026).

For multi-image settings, MG-Data-240K (Zheng et al., 8 Jan 2026) aggregates:

Task type Samples Source Datasets
Referential Retrieval 97k D³, COCO
Semantic Association 77k COCO
Spatial Association 20k Ego-Exo4D, MVTrack
Temporal Association 46k STAR

Other major datasets: Ref-ZOM (static, multi/no referent), D³ (described object detection, multi-modal), Flickr30k Entities (phrase grounding) (Xiao et al., 2024).

3. Model Architectures and Design Innovations

Anchor-based Decoders

DET- or MDETR-style models instantiate NN learnable queries, each capable of outputting a box or nothing. The number of targets is inferred by thresholding query confidences (Wang et al., 2 Jan 2025, Xiao et al., 2024).

Count-aware Heads: Dynamic counter heads (e.g., Adaptive Grounding Counter in HieA2G) cluster and select proposals by predicted count, supporting flexible multi/none output (Wang et al., 2 Jan 2025).

Proposal-driven / Hybrid Designs

PropVG (Dai et al., 5 Sep 2025) integrates end-to-end object proposal generation (via deformable DETR decoders) with multi-granularity contrastive referential scoring at both sentence and word levels, plus global segmentation–driven absent-target discrimination. Its architecture:

  • Proposal branch: NN candidate boxes scored for referentiality
  • Multi-granularity discrimination via sentence/word contrastive losses and semantic/object cross-attention
  • Existence head for "no referent" detection via fused scores (referentiality ×\times mask ×\times existence) (Dai et al., 5 Sep 2025)

InstanceVG (Dai et al., 17 Sep 2025) further unifies instance-level joint prediction and segmentation, enforcing box-mask-point consistency per instance query.

Multi-modal and Multi-image LLMs

GeM-VG (Zheng et al., 8 Jan 2026) utilizes a Qwen2-VL vision encoder and LLM decoder to ingest multi-image sets and yield arbitrarily many region predictions per query, blending chain-of-thought reasoning and direct answers. It employs GRPO-style RL finetuning with multi-dimensional rule-based rewards (format, image, precision, recall).

VGR (Wang et al., 13 Jun 2025) applies order-autoregressive generation for bounding box selection in the reasoning chain, replaying region tokens for interpretable multimodal deduction.

Hierarchical, Latent, and Segmentation-based Designs

Hierarchical alignment (word/phrase/text-to-object) is employed by HieA2G for robust multi-modal fusion; Latent-VG generates multiple "latent" textual representations via subject distributor and concept injector mechanisms to capture distinct attribute bindings and enhance multi/no-target generalization (Yu et al., 7 Aug 2025, Wang et al., 2 Jan 2025).

4. Training Objectives, Losses, and Optimization Strategies

Supervised Objectives are built from combinations of:

Counting Losses: Auxiliary contrastive counting draws features together for matching cardinalities and apart for differing cardinalities (Wang et al., 2 Jan 2025).

Multi-stage Finetuning: For multi-image RL finetuning, GeM-VG introduces hybrid strategies blending chain-of-thought and direct answers, where RL rewards are modulated for both output syntax and recall/precision (Zheng et al., 8 Jan 2026).

Existence heads and consistency constraints—as in InstanceVG—explicitly supervise no-target discrimination and enforce instance-matching between boxes and masks (Dai et al., 17 Sep 2025).

5. Quantitative Benchmarks and Comparative Results

State-of-the-art GREC/GRES detection and segmentation results (gRefCOCO splits, IoU≥0.5\ge0.5, Pr@F1/N-acc):

Method Val Pr@F1 Val N-acc TestA Pr@F1 TestA N-acc TestB Pr@F1 TestB N-acc
MCN 28.0 30.6 32.3 32.0 26.8 30.3
VLT 36.6 35.2 40.2 34.1 30.2 32.5
MDETR 42.7 36.3 50.0 34.5 36.5 31.0
UNINEXT 58.2 50.6 46.4 49.3 42.9 48.2
SimVG 62.1 54.7 64.6 57.2 54.8 57.2
PropVG 72.2 72.8 68.8 69.9 59.0 65.0
InstanceVG 73.5 72.8 70.2 71.1 60.8 65.2
ReLA (Swin-B) 61.9 56.4 — — — —
HieA2G_R101 67.8 60.3 66.0 60.1 56.5 56.0

Multi-image and multi-instance LLMs (GeM-VG) advance state-of-the-art by +2.0+2.0–$9.7$ pp over previous bests on dedicated multi-image grounding benchmarks (Zheng et al., 8 Jan 2026).

Zero-shot solutions (GroundVLP) yield absolute gains of +18+18–$28$ pp vs. prior zero-shot SOTA, rivaling supervised methods on REC and phrase grounding (Shen et al., 2023).

6. Extensions: Dialogue, Multi-image, Multi-task, and Chain-of-Thought Reasoning

Dialogue-based GREC requires coreference resolution over multiple turns. A three-tier data synthesis framework (template, LLM-generated, dialogue-coreferent) demonstrably bridges train–test domain shift, with mean F1 gains  >\,>20 pp on synthetic and real dialogue testbeds (Shao et al., 2 Dec 2025).

Multi-image GREC generalizes the standard task to sets of images, requiring models to aggregate cross-image cues, association reasoning (semantic, spatial, temporal), and explicit multi-image output handling. RL-finetuned LLMs with token-level output formatting outperform single-image counterparts (Zheng et al., 8 Jan 2026).

Multi-task learning (GREC+GRES): Joint and consistency constraints across detection and segmentation improve overall accuracy, with instance-aware masking, point-guided assignment, and cross-modal alignment yielding additional gains (Dai et al., 17 Sep 2025, Ding et al., 8 Jan 2026).

Chain-of-thought integration: Recent advances blend explicit, interpretable reasoning traces with region grounding, where visual replay tokens and multimodal attention progressively locate and justify the referent set (Wang et al., 13 Jun 2025). RL with grounding-aware rewards further boosts performance for complex, reasoning-heavy queries (Zheng et al., 8 Jan 2026).

7. Challenges, Limitations, and Future Directions

  • No-target discrimination: N-acc remains lower than single/multi-target; best models reach ∼\sim75%.
  • Complex relationships: Handling exception, numerosity, possession, and compound expressions is challenging.
  • Data scale and diversity: Available GREC datasets remain orders of magnitude smaller than counterparts in detection and captioning; large-scale, linguistically rich web benchmarks are urgently needed (Xiao et al., 2024).
  • Metric robustness: Strict F1=1 matching is widely adopted, but soft-F1 and set-level relaxation may better reflect practical utility.
  • Representation design: Object-centric proposals allow symbolic reasoning; pixel-level patch models increase granularity but add ambiguity; hybrids are underexplored (Pantazopoulos et al., 12 Sep 2025).
  • Model generalizability: Universal GREC for static, dynamic (video), multi-view, and multimodal applications remains as an open frontier.
  • Efficient RL and interpretability: Balancing reasoning patterns and direct perception is computationally expensive; architectural innovations for scalable hybrid finetuning are needed (Zheng et al., 8 Jan 2026).

CONTENTS EDITOR'S TERM: The field continues to evolve towards jointly grounded multi-modal models, universal benchmarks, and compositional understanding spanning images, video, and dialogue. Further research is required to optimize large-scale data synthesis, multi-modal chain-of-thought mechanisms, and adaptive architecture designs for robust generalized visual grounding.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Generalized Visual Grounding (GREC).