Generalized Visual Grounding (GREC)

Updated 29 January 2026

GREC is the task of localizing any number of objects in visual scenes using free-form expressions, covering single-target, multi-target, and no-target scenarios.
It extends classic referring expression comprehension by introducing variable cardinality, open-vocabulary understanding, and exact set-level matching in evaluation.
Key innovations include multi-modal model architectures, count-aware mechanisms, and dialogue as well as multi-image strategies to enhance grounding accuracy and applicability.

Generalized Visual Grounding (GREC) is the task of localizing an arbitrary number of objects in complex visual scenes according to a free-form expression, encompassing single-target, multi-target, and no-target referents. This setting generalizes classical Referring Expression Comprehension (REC), which is restricted to one target per query, by introducing variable cardinality, open-vocabulary comprehension, and robust absent-target handling. GREC is formally defined as a mapping from images and textual expressions to sets of bounding boxes, where the number of outputs is unknown and may be zero; evaluation focuses on exact matching of predicted sets to ground-truth, with metrics tailored to account for the challenges of multi-object and no-object cases (He et al., 2023, Ding et al., 8 Jan 2026, Xiao et al., 2024).

1. Formal Definition, Taxonomy, and Evaluation

GREC generalizes the classic REC as follows:

Inputs: One or more images $V = \{V_i\}_{i=1}^m$ , each $V_i \in \mathbb{R}^{H \times W \times 3}$ ; a natural-language query $T$ .
Outputs: A set of $n$ grounded instances $O = \{(b_k, i_k)\}_{k=1}^n = \mathcal{M}(V,T)$ ; each $b_k = (x_1, y_1, x_2, y_2)$ is a bounding box, and $i_k \in \{1, ..., m\}$ denotes the source image.
Generalization: $m=1$ recovers single-image REC; $n>1$ allows multi-object; $n=0$ supports absent-target/no-target cases (Zheng et al., 8 Jan 2026).

Taxonomies organize tasks by cross-image cue reliance (referential, semantic association, temporal association, spatial association, reasoning) and image relationships (same scene, multi-view, multi-frame, disjoint semantic composition) (Zheng et al., 8 Jan 2026).

Evaluation Metrics:

Precision@(F1=1, IoU $\ge$ 0.5): Fraction of samples where the predicted box set matches ground truth perfectly (all pairs $\mathrm{IoU} \ge 0.5$ and set sizes agree) (Ding et al., 8 Jan 2026).
No-target accuracy (N-acc): Fraction of absent-target samples with zero predictions.
Traditional [email protected]–.95: Used for comparison, but not ideal for GREC because set-level matching is crucial (He et al., 2023, Ding et al., 8 Jan 2026).

2. Datasets for GREC and Multi-image Grounding

The principal annotated resource for GREC is gRefCOCO, which extends RefCOCO with multi- and no-target expressions:

Dataset	#Images	#Expressions	Single	Multi	None
gRefCOCO	19,994	≈259,859	~135k	~90k	~35k

gRefCOCO features strict validator-rewrite for ambiguous or ungroundable expressions; annotation covers free-form, multi-target, compound, and "no referent" queries (He et al., 2023, Ding et al., 8 Jan 2026).

For multi-image settings, MG-Data-240K (Zheng et al., 8 Jan 2026) aggregates:

Task type	Samples	Source Datasets
Referential Retrieval	97k	D³, COCO
Semantic Association	77k	COCO
Spatial Association	20k	Ego-Exo4D, MVTrack
Temporal Association	46k	STAR

Other major datasets: Ref-ZOM (static, multi/no referent), D³ (described object detection, multi-modal), Flickr30k Entities (phrase grounding) (Xiao et al., 2024).

3. Model Architectures and Design Innovations

Anchor-based Decoders

DET- or MDETR-style models instantiate $N$ learnable queries, each capable of outputting a box or nothing. The number of targets is inferred by thresholding query confidences (Wang et al., 2 Jan 2025, Xiao et al., 2024).

Count-aware Heads: Dynamic counter heads (e.g., Adaptive Grounding Counter in HieA2G) cluster and select proposals by predicted count, supporting flexible multi/none output (Wang et al., 2 Jan 2025).

Proposal-driven / Hybrid Designs

PropVG (Dai et al., 5 Sep 2025) integrates end-to-end object proposal generation (via deformable DETR decoders) with multi-granularity contrastive referential scoring at both sentence and word levels, plus global segmentation–driven absent-target discrimination. Its architecture:

Proposal branch: $N$ candidate boxes scored for referentiality
Multi-granularity discrimination via sentence/word contrastive losses and semantic/object cross-attention
Existence head for "no referent" detection via fused scores (referentiality $\times$ mask $\times$ existence) (Dai et al., 5 Sep 2025)

InstanceVG (Dai et al., 17 Sep 2025) further unifies instance-level joint prediction and segmentation, enforcing box-mask-point consistency per instance query.

GeM-VG (Zheng et al., 8 Jan 2026) utilizes a Qwen2-VL vision encoder and LLM decoder to ingest multi-image sets and yield arbitrarily many region predictions per query, blending chain-of-thought reasoning and direct answers. It employs GRPO-style RL finetuning with multi-dimensional rule-based rewards (format, image, precision, recall).

VGR (Wang et al., 13 Jun 2025) applies order-autoregressive generation for bounding box selection in the reasoning chain, replaying region tokens for interpretable multimodal deduction.

Hierarchical, Latent, and Segmentation-based Designs

Hierarchical alignment (word/phrase/text-to-object) is employed by HieA2G for robust multi-modal fusion; Latent-VG generates multiple "latent" textual representations via subject distributor and concept injector mechanisms to capture distinct attribute bindings and enhance multi/no-target generalization (Yu et al., 7 Aug 2025, Wang et al., 2 Jan 2025).

4. Training Objectives, Losses, and Optimization Strategies

Supervised Objectives are built from combinations of:

Cross-entropy classification over objectness scores
Box regression (L1, GIoU, Dice losses)
Contrastive losses (InfoNCE, positive-margin)
Mask-based segmentation losses (BCE, Dice) for segmentation extensions (Wang et al., 2 Jan 2025, Yu et al., 7 Aug 2025, Ding et al., 8 Jan 2026, Dai et al., 5 Sep 2025)

Counting Losses: Auxiliary contrastive counting draws features together for matching cardinalities and apart for differing cardinalities (Wang et al., 2 Jan 2025).

Multi-stage Finetuning: For multi-image RL finetuning, GeM-VG introduces hybrid strategies blending chain-of-thought and direct answers, where RL rewards are modulated for both output syntax and recall/precision (Zheng et al., 8 Jan 2026).

Existence heads and consistency constraints—as in InstanceVG—explicitly supervise no-target discrimination and enforce instance-matching between boxes and masks (Dai et al., 17 Sep 2025).

5. Quantitative Benchmarks and Comparative Results

State-of-the-art GREC/GRES detection and segmentation results (gRefCOCO splits, IoU $\ge0.5$ , Pr@F1/N-acc):

Method	Val Pr@F1	Val N-acc	TestA Pr@F1	TestA N-acc	TestB Pr@F1	TestB N-acc
MCN	28.0	30.6	32.3	32.0	26.8	30.3
VLT	36.6	35.2	40.2	34.1	30.2	32.5
MDETR	42.7	36.3	50.0	34.5	36.5	31.0
UNINEXT	58.2	50.6	46.4	49.3	42.9	48.2
SimVG	62.1	54.7	64.6	57.2	54.8	57.2
PropVG	72.2	72.8	68.8	69.9	59.0	65.0
InstanceVG	73.5	72.8	70.2	71.1	60.8	65.2
ReLA (Swin-B)	61.9	56.4	—	—	—	—
HieA2G_R101	67.8	60.3	66.0	60.1	56.5	56.0

Multi-image and multi-instance LLMs (GeM-VG) advance state-of-the-art by $+2.0$ –$9.7$ pp over previous bests on dedicated multi-image grounding benchmarks (Zheng et al., 8 Jan 2026).

Zero-shot solutions (GroundVLP) yield absolute gains of $+18$ –$28$ pp vs. prior zero-shot SOTA, rivaling supervised methods on REC and phrase grounding (Shen et al., 2023).

6. Extensions: Dialogue, Multi-image, Multi-task, and Chain-of-Thought Reasoning

Dialogue-based GREC requires coreference resolution over multiple turns. A three-tier data synthesis framework (template, LLM-generated, dialogue-coreferent) demonstrably bridges train–test domain shift, with mean F1 gains $\,>$ 20 pp on synthetic and real dialogue testbeds (Shao et al., 2 Dec 2025).

Multi-image GREC generalizes the standard task to sets of images, requiring models to aggregate cross-image cues, association reasoning (semantic, spatial, temporal), and explicit multi-image output handling. RL-finetuned LLMs with token-level output formatting outperform single-image counterparts (Zheng et al., 8 Jan 2026).

Multi-task learning (GREC+GRES): Joint and consistency constraints across detection and segmentation improve overall accuracy, with instance-aware masking, point-guided assignment, and cross-modal alignment yielding additional gains (Dai et al., 17 Sep 2025, Ding et al., 8 Jan 2026).

Chain-of-thought integration: Recent advances blend explicit, interpretable reasoning traces with region grounding, where visual replay tokens and multimodal attention progressively locate and justify the referent set (Wang et al., 13 Jun 2025). RL with grounding-aware rewards further boosts performance for complex, reasoning-heavy queries (Zheng et al., 8 Jan 2026).

7. Challenges, Limitations, and Future Directions

No-target discrimination: N-acc remains lower than single/multi-target; best models reach $\sim$ 75%.
Complex relationships: Handling exception, numerosity, possession, and compound expressions is challenging.
Data scale and diversity: Available GREC datasets remain orders of magnitude smaller than counterparts in detection and captioning; large-scale, linguistically rich web benchmarks are urgently needed (Xiao et al., 2024).
Metric robustness: Strict F1=1 matching is widely adopted, but soft-F1 and set-level relaxation may better reflect practical utility.
Representation design: Object-centric proposals allow symbolic reasoning; pixel-level patch models increase granularity but add ambiguity; hybrids are underexplored (Pantazopoulos et al., 12 Sep 2025).
Model generalizability: Universal GREC for static, dynamic (video), multi-view, and multimodal applications remains as an open frontier.
Efficient RL and interpretability: Balancing reasoning patterns and direct perception is computationally expensive; architectural innovations for scalable hybrid finetuning are needed (Zheng et al., 8 Jan 2026).

CONTENTS EDITOR'S TERM: The field continues to evolve towards jointly grounded multi-modal models, universal benchmarks, and compositional understanding spanning images, video, and dialogue. Further research is required to optimize large-scale data synthesis, multi-modal chain-of-thought mechanisms, and adaptive architecture designs for robust generalized visual grounding.