Generalized Referring Expression Comprehension
- GREC is defined as grounding referring expressions to zero, one, or multiple objects using set-valued mappings.
- Models use specialized multi-modal techniques, including hierarchical alignment and count heads, to resolve ambiguities in multi-target and no-target queries.
- Benchmark datasets like gRefCOCO and evaluation metrics (e.g., Precision, N-acc) demonstrate both model advances and persistent challenges in achieving accurate detection.
Generalized Referring Expression Comprehension (GREC) denotes the family of vision-language grounding tasks where a model receives a natural-language referring expression and must identify every corresponding object (zero, one, or many) in a visual scene. GREC extends the classical Referring Expression Comprehension (REC) paradigm—traditionally constrained to single-target expressions—by supporting multi-target (e.g., “the three red apples”) and no-target (e.g., “no blue car here”) queries. This generalization captures the open-world and compositional demands of real-world applications, where referring language is unconstrained and ambiguous in target count. The GREC task is tightly linked with its direct sibling, Generalized Referring Expression Segmentation (GRES), as well as group-wise, dialogue-based, instance-retrieval, and weakly supervised versions, all of which leverage or extend the foundational GREC definition.
1. Formal Problem Definition and Core Distinctions
Generalized Referring Expression Comprehension (GREC) is defined as follows: Given an input image and a natural-language expression , the model must predict a set of bounding boxes , , each corresponding to an object referred to by . The cardinality of is determined by the semantics of and may be zero (“no-target” case), one (classical case), or more than one (“multi-target”) (He et al., 2023, Ding et al., 8 Jan 2026, Dai et al., 17 Sep 2025, Wang et al., 2 Jan 2025).
This framework removes the one-to-one mapping assumption (expression single object) of classical REC, subsuming cases where . Formally, a set-valued mapping is required: where approximates the true target set under an intersection-over-union (IoU) criterion.
No-target expressions demand that the model reject all possible region proposals—an unaddressed challenge in classical REC. Multi-target expressions introduce combinatorial complexity, requiring joint grounding and (often) object counting (He et al., 2023, Ding et al., 8 Jan 2026).
2. Benchmark Datasets and Evaluation Protocols
Datasets
- gRefCOCO: The first large-scale GREC benchmark, constructed by extending RefCOCO with multi-target and no-target queries for each image. It features approximately 19,994 images, 259,859 expressions (≈50% single-target, 35% multi-target, 13% no-target), and ≈61,316 distinct annotated instances (Ding et al., 8 Jan 2026, He et al., 2023).
- GRD: Targets group-wise GRES, providing exhaustive mask annotations for both positive and hard negative examples across related image sets (Wu et al., 2023).
- REIRCOCO: Gallery-level expressions for instance retrieval, linking fine-grained queries to object-level groundings across a large image corpus (Hao et al., 23 Jun 2025).
Evaluation Metrics
GREC tasks use set- and sample-level matching metrics:
- Precision@(=1, IoU0.5): The fraction of samples for which all and only the correct instances are detected with IoU0.5. For no-target, the output must be the empty set.
- No-target Accuracy (N-acc): Percentage of no-target expressions where the model predicts zero boxes.
- Counting/Recall (e.g., T-acc): The proportion of expressions for which at least one (for multi-target) or the correct number of bounding boxes are predicted.
- Recall@K and BoxRecall@K(τ): Used in gallery/instance-retrieval GREC (e.g., REIR), combining retrieval and localization (Hao et al., 23 Jun 2025).
- Perceptual & reasoning breakdowns: Specialized benchmark suites (e.g., RefBench-PRO) decompose accuracy into perception (attribute, position, interaction) and reasoning (relation, commonsense) as well as rejection metrics for none-of-the-above cases (Gao et al., 6 Dec 2025).
3. Model Architectures and Methodological Advances
Classical Extensions
Initial approaches adapt single-target REC models (e.g., MCN, VLT, MDETR, UNINEXT) to multi-target outputs via multi-label classification, confidence thresholding, or set-prediction losses, but fall short in zero-target detection and precise handling of multi-target semantics (He et al., 2023, Ding et al., 8 Jan 2026).
Explicit Generalized GREC Designs
- ReLA: Partitions images into soft regions using cross-attention, models region–region and region–language dependencies, and predicts the number of targets via a count head. This avoids ad-hoc thresholding and enhances multi-target detection (Ding et al., 8 Jan 2026).
- HieA2G: Incorporates hierarchical cross-modal alignment at word-object, phrase-object, and text-image levels, coupled with an Adaptive Grounding Counter (AGC) that dynamically predicts output cardinality, further reinforced by a supervised contrastive loss that clusters count-equivalent samples (Wang et al., 2 Jan 2025).
- InstanceVG: A single network for joint GREC and GRES, instantiating instance-aware queries with spatial priors, a deformable Transformer decoder, and a perception head enforcing box–mask consistency via Hungarian matching. This framework achieves SOTA on ten benchmarks and demonstrates the benefits of unified training for segmentation and detection tasks (Dai et al., 17 Sep 2025).
Weakly and Zero-Shot Supervision
- LIHE: Proposes a two-stage approach—VLM-based referential decoupling of an expression into sub-phrases, then hybrid Euclidean–hyperbolic similarity (HEMix) based anchor matching to ground each sub-phrase, all under weak supervision with only a binary indicator of validity at train time (Shi et al., 15 Nov 2025).
Zero-shot
- Prompt-based VLM Pipelines: Solutions for the GCAIAC Challenge utilize CLIP and SAM, employing multi-granularity visual prompts (coarse bounding-box overlays, fine-grained SAM-based masks), textual prompt denoising, and permutation-based joint assignment via the Hungarian algorithm. This achieves substantial improvements in zero-shot GREC (Huang et al., 2024).
Instance Retrieval
- CLARE: Bridges the gallery-level retrieval and instance-level grounding gap by learning joint embeddings for images, object candidates, and queries via a dual-stream network with a Mix of Relation Experts (MORE) module. This outperforms stage-wise REC+retrieval pipelines in both accuracy and scalability (Hao et al., 23 Jun 2025).
4. Principal Experimental Findings
Performance Trends
- Classic REC models, when extended to GREC settings, see a dramatic drop in precision—from ≈85–90% (single-target, RefCOCO) to 30–50% on multi-/no-target gRefCOCO splits (He et al., 2023, Ding et al., 8 Jan 2026).
- Explicit region-aware, multi-modal, and count-aware architectures (ReLA, HieA2G, InstanceVG) raise GREC precision to 60–73% on gRefCOCO (val) (Ding et al., 8 Jan 2026, Wang et al., 2 Jan 2025, Dai et al., 17 Sep 2025).
- No-target detection remains a major bottleneck—state-of-the-art models achieve N-acc between ≈56–73% but still miss a significant fraction of zero-referent cases (Ding et al., 8 Jan 2026, Wang et al., 2 Jan 2025, Dai et al., 17 Sep 2025, Gao et al., 6 Dec 2025, Shi et al., 15 Nov 2025).
- In weakly-supervised GREC (WGREC), integrating hyperbolic representations for hierarchy and instance-splitting improves both detection and differentiation between semantically similar referents (Shi et al., 15 Nov 2025).
Ablative Insights
- Joint training on GREC and GRES (e.g., InstanceVG) gives consistent boosts to both detection and pixel-level segmentation, highlighting the streamlining effect of consistent multi-granularity predictions (Dai et al., 17 Sep 2025).
- Hierarchical alignment modules substantively improve holding power for multi-clause, complex, or ambiguous queries by resolving linguistic ambiguity across multiple semantic levels (Wang et al., 2 Jan 2025).
- Counting modules (e.g., AGC, explicit count heads) are crucial—simple threshold-based selection of outputs yields inferior results relative to models that explicitly predict output cardinality (Wang et al., 2 Jan 2025, Ding et al., 8 Jan 2026).
5. Advanced Evaluation and Taxonomy
GREC has been further dissected along compositional axes:
- Perceptual and Reasoning Benchmarks: RefBench-PRO decomposes test queries by visual (attribute, position, interaction) and reasoning (relation, commonsense, rejection) difficulty, revealing aggregate model performance gaps—particularly in rejection and high-level reasoning settings (e.g., accuracy on 'Relation' and 'Commonsense' ≈6% below perception; rejection accuracy often at random levels) (Gao et al., 6 Dec 2025).
- Spatial Language and Compositionality: Explicit reasoning over spatial categories and negation demonstrates that large VLMs and specialized REC models exhibit marked performance drop when facing directional, chained, or negated queries—pointing to persistent constraints in compositional reasoning (Tumu et al., 4 Feb 2025).
| Model | Pr@F₁ (gRefCOCO val) | N-acc (%) | Notable Feature |
|---|---|---|---|
| MCN† | 28.0 | 30.6 | Classic, Yolo-based REC (He et al., 2023) |
| UNINEXT† | 58.2 | 50.6 | DETR-like extension |
| ReLA (Swin-B) | 61.9 | 56.4 | Region–region/language attn |
| HieA2G | 67.8 | 60.3 | Hierarchical alignment, AGC |
| InstanceVG | 73.5 | 72.8 | Joint GREC/GRES, instance-aware |
| LIHE (WS) | 39.6 | 67.5 | Weak supervision, hyperbolic |
6. Extensions: Group-wise, Dialogue-based, and Segmentation GREC
- Group-wise GREC/GRES: GRSer on GRD demonstrates that leveraging intra-group visual "experts" and explicit heatmap reliability ranking increases performance on both segmentation and negative rejection (Wu et al., 2023).
- Dialogue-grounded GREC: Multi-turn and coreferential queries require specialized data synthesis and model adaptation. Three-tier synthetic data (template, GPT-prompted, dialogue) enable significant gains under distribution shift, although coreference beyond two dialogue turns remains a challenge (Shao et al., 2 Dec 2025).
- GREC–GRES Joint Modeling: Recent models increasingly unify detection and segmentation heads with instance-aware joint loss, supporting true pixel-level grounding for multiple or zero referents (Dai et al., 17 Sep 2025, Ding et al., 8 Jan 2026, Li et al., 2024).
7. Open Problems and Future Research Directions
Current limitations and research frontiers include:
- No-target and negative case detection: Models still hallucinate objects or fail to abstain from incorrect box outputs, especially in deceptive negative cases or when compositional reasoning is required (Ding et al., 8 Jan 2026, Gao et al., 6 Dec 2025, Shi et al., 15 Nov 2025).
- Complex relational and ordinal language: Counting, exclusion (“everyone except …”), and nested relationships remain error-prone across architectures (Ding et al., 8 Jan 2026, Shao et al., 2 Dec 2025, Tumu et al., 4 Feb 2025).
- Scaling to open-world and cross-domain expressions: Dataset and model generalization beyond COCO-centric domains and to multilingual settings is only partially addressed (Ding et al., 8 Jan 2026, Wang et al., 2 Jan 2025).
- Fine-grained spatial and commonsense reasoning: Directionality, dynamic relations, negation, and task composition continue to limit model performance; enhanced compositional pipelines and negative/counterfactual sampling are recommended (Tumu et al., 4 Feb 2025, Gao et al., 6 Dec 2025).
- Weak and zero-shot slots: Further study of VLM decoupling, adaptive prompt learning, and pseudo-label distillation (e.g., LIHE as teacher) is needed to close the performance gap under low/no supervision (Shi et al., 15 Nov 2025, Huang et al., 2024).
Future research is expected to advance multi-modal compositional networks, negative sample mining, hybrid geometric representations, and integration of world knowledge via LLMs to address the current limits of Generalized Referring Expression Comprehension.
References
- (He et al., 2023): "GREC: Generalized Referring Expression Comprehension"
- (Ding et al., 8 Jan 2026): "GREx: Generalized Referring Expression Segmentation, Comprehension, and Generation"
- (Dai et al., 17 Sep 2025): "Improving Generalized Visual Grounding with Instance-aware Joint Learning"
- (Wang et al., 2 Jan 2025): "Hierarchical Alignment-enhanced Adaptive Grounding Network for Generalized Referring Expression Comprehension"
- (Shi et al., 15 Nov 2025): "LIHE: Linguistic Instance-Split Hyperbolic-Euclidean Framework for Generalized Weakly-Supervised Referring Expression Comprehension"
- (Gao et al., 6 Dec 2025): "RefBench-PRO: Perceptual and Reasoning Oriented Benchmark for Referring Expression Comprehension"
- (Huang et al., 2024): "The Solution for the 5th GCAIAC Zero-shot Referring Expression Comprehension Challenge"
- (Hao et al., 23 Jun 2025): "Referring Expression Instance Retrieval and A Strong End-to-End Baseline"
- (Wu et al., 2023): "Advancing Referring Expression Segmentation Beyond Single Image"
- (Shao et al., 2 Dec 2025): "Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension"
- (Li et al., 2024): "Bring Adaptive Binding Prototypes to Generalized Referring Expression Segmentation"
- (Tumu et al., 4 Feb 2025): "Exploring Spatial Language Grounding Through Referring Expressions"