Referring Expression Segmentation (RES) Overview
- Referring Expression Segmentation (RES) is the task of generating pixel-level masks for image regions described by natural language, enabling open-vocabulary and multi-instance segmentation.
- RES models integrate vision encoders and language transformers using cross-modal attention and prototype binding to precisely match spatial regions with linguistic cues.
- Advances in RES include diverse supervision regimes—from fully to weakly and omni-supervised learning—and extend to complex domains like 3D point clouds and fine-grained part segmentation.
Referring Expression Segmentation (RES) is the task of producing pixel-level segmentation masks for regions of an image that correspond to a given natural language expression. Unlike standard semantic segmentation (assigning class labels to pixels), RES operates in an open-vocabulary regime and directly grounds free-form linguistic queries (e.g., “the man in the blue shirt next to the car”) to precise image regions. RES research has evolved from single-instance, fully-supervised settings to semi- and weakly-supervised protocols, open-vocabulary and multi-target generalizations, and even 3D point cloud domains. This article synthesizes technical developments, datasets, model architectures, supervision protocols, applications, and current challenges in RES and its generalizations.
1. Task Formulation and Variants
RES is formally defined as follows: given an image and a referring expression , predict a binary mask such that pixels with correspond to the region of described by (Wang et al., 2023, Liu et al., 2023, Wang et al., 6 Aug 2025). The classic setting assumes exactly one object is referred to by . Recent work generalizes this to arbitrary cardinality (multi-object or part-level RES, zero-target/no-object cases) (Liu et al., 2023, Li et al., 2024, Ding et al., 8 Jan 2026).
Generalized RES (GRES) defines the output for each sample as mask (the union of all regions referred to) and flag (no-target indicator) (Ding et al., 8 Jan 2026). Multi-granularity RES (MRES) further admits queries referring to parts, mixtures, or multi-instance targets, requiring models to recognize and segment object-level, part-level, and composite regions (Wang et al., 2023, Liu et al., 2 Apr 2025).
2. Datasets and Benchmarks
RES has a rich suite of datasets:
- Classic datasets: RefCOCO, RefCOCO+, RefCOCOg (MS-COCO images, single-object per query, tens of thousands of human-written expressions with pixel-level masks) (Liu et al., 2023, Wang et al., 2023, Liu et al., 2 Apr 2025).
- Generalized/multi-target datasets: gRefCOCO (multi-object, no-object cases annotated; ~278k expressions across 19,994 images; covers single/multi/no-target) (Ding et al., 8 Jan 2026, Liu et al., 2023, Li et al., 2024).
- Multi-granularity/part-level: RefCOCOm (manual part annotations for evaluation; 70k part-expressions, 391 part categories) (Wang et al., 2023, Liu et al., 2 Apr 2025). MRES-32M (auto-generated, 32.2M instances, 1M images, covers object and part-level masks/captions, scalable training corpus) (Wang et al., 2023, Liu et al., 2 Apr 2025).
- 3D point cloud: ScanRefer, Multi3DRefer, Multi3DRes (expressions and instance masks for 3D scenes; supports 3D-GRES, including multi-object, zero-object referring) (Wu et al., 2024, Wu et al., 2024, Chen et al., 9 Jan 2025).
- Open-vocabulary/part/function: ABO-Image-ARES, ReasonSeg (queries involving part, function, type, spatial properties—challenge for zero-shot and LLM-based models) (Wang et al., 3 May 2025).
3. Model Architectures and Methodologies
Modern RES architectures are principally multimodal, fusing vision and language representations:
- Vision encoder: Typically Swin Transformer or ViT for images (Li et al., 2024, Wang et al., 6 Aug 2025).
- Language encoder: BERT or CLIP-style text Transformer (Liu et al., 2 Apr 2025, Wang et al., 2023).
- Fusion schemes: Cross-attention between region/patch tokens and text tokens, sometimes with iterative multi-scale or region-based interaction (Liu et al., 2023, Li et al., 2024).
- Prototype and region query-based decoders: Adaptive binding prototypes (MABP) bind spatial queries to local regions, enabling multi-instance and part-level segmentation, dispersing global pressure across queries (Li et al., 2024). Region-aware attention modules (ReLA) dynamically form regional filters, model region-region, and region-language dependencies, with soft assignment and attention (Ding et al., 8 Jan 2026, Liu et al., 2023).
- Multimodal LLMs (MLLMs): Recent RES approaches employ large vision-LLMs for high-level semantics, semantic-detail fusion (DSFF) for robust mask prediction with light-weight decoders (Wang et al., 6 Aug 2025). Chain-of-Thought attribute prompting with LLMs scaffolds open-vocabulary zero-shot segmentation (RESAnything) without fine-tuning, integrating textual reasoning about style, function, or part properties (Wang et al., 3 May 2025).
- 3D expansion: RES architectures extend to 3D via point cloud encoding, multimodal query initialization (superpoints, sparse queries), cross-modal attention, and multi-object decoupling losses (Wu et al., 2024, Chen et al., 9 Jan 2025, Wu et al., 2024). Rule-guided spatial awareness networks use dependency-tree rules to guide weak supervision for 3D spatial entity modeling (Wu et al., 2024).
4. Supervision Regimes: Full, Semi, Weak, and Omni-Supervised
RES has evolved past fully-supervised learning with dense mask annotations:
- Semi-supervised learning: Teacher–student schemes utilize a small labeled set and large unlabeled corpus, employing techniques such as pseudo-label consistency, refined via models like SAM (Segment Anything Model) for precise boundaries (Yang et al., 2024, Zang et al., 2024). Specialized data augmentations, text perturbation, and confidence-guided training yield strong improvements (e.g., SemiRES achieves +18.64% IoU with 1% labels over supervised baseline) (Yang et al., 2024, Zang et al., 2024).
- Weakly-supervised learning: Training uses click-, point-, or box-level annotations instead of full masks, with weak-label selection or active pseudo-mask refinement (APLR) for mask generation from weak signals (Huang et al., 2023, Nag et al., 2024). SafaRi introduces cross-modal fusion with attention consistency, bootstrapping robust segmentation from partial masks/boxes via spatially aware pseudo-label filtering (Nag et al., 2024).
- Omni-supervised learning: Omni-RES unifies labeled, weakly labeled, and unlabeled data, employing teacher-student learning and mask refinement yardsticks (e.g., grounding points/boxes are used as criteria for pseudo-mask trustworthiness) (Huang et al., 2023). This mode allows leveraging large-scale external weakly annotated corpora (Visual Genome) to reach fully-supervised performance with 10% dense labels (Huang et al., 2023).
5. Evaluation Metrics and Results
Standard benchmarks and metrics:
- Intersection over Union (IoU): Standard per-instance overlap, reported as mean (mIoU), overall (oIoU), cumulative (cIoU—over all pixels), or generalized (gIoU—accounts for multi-target/zero-target) (Ding et al., 8 Jan 2026, Liu et al., 2023, Wang et al., 2023, Wang et al., 6 Aug 2025, Li et al., 2024).
- Precision@X: Fraction of predictions achieving IoU (Liu et al., 2023).
- No-target and target accuracy (N-acc, T-acc): Measures correct absent/present predictions in generalized settings (Ding et al., 8 Jan 2026).
- Qualitative assessments: Fine-grained boundary, part-level detail, functional/implicit referencing (Wang et al., 3 May 2025, Wang et al., 2023).
Critical model results:
- GRES/MRES: MABP sets state-of-the-art on gRefCOCO (val cIoU/gIoU: 65.7/68.9%; testB: 62.8/64.0%), outperforming ReLA and prior bests (Li et al., 2024). UniRES++—a unified MLLM architecture—achieves +8–10% part-level mIoU on RefCOCOm and +6–10% gains on gRefCOCO’s cIoU/gIoU (Liu et al., 2 Apr 2025).
- Classic RES: UniRES++ reaches 80.8/82.4/77.0% mIoU on RefCOCO val/testA/testB, surpassing prior SOTA (Liu et al., 2 Apr 2025). MLLMSeg delivers 78.9% cIoU on RefCOCO with only 34M decoder parameters, matching heavy SAM-based approaches (Wang et al., 6 Aug 2025).
- 3D-GRES: MDIN attains 47.5% mIoU and 44.7% [email protected] on Multi3DRes, +4.5—8.8 points over earlier methods (Wu et al., 2024).
- Semi/Omni-supervised: SemiRES, Omni-RES, and SafaRi consistently outperform fully supervised protocols at low annotation rates; e.g., Omni-Box (10% labels) matches 100% supervision (Huang et al., 2023, Yang et al., 2024, Nag et al., 2024).
- Zero-shot/LLM-based: RESAnything achieves gIoU/cIoU of 78.2/72.4% on ABO-Image-ARES and 74.6%/72.5% on ReasonSeg, far exceeding zero-shot baselines and challenging fine-tuned specialists (Wang et al., 3 May 2025).
6. Technical Innovations and Trends
Technical advances driving the field include:
- Prototype and region query binding: Adaptive prototypes for region-specific matching enable flexible instance/part assignment, reducing pressure on encoder modules in multi-target/part settings (Li et al., 2024).
- Attribute prompting and Chain-of-Thought reasoning: Scaffolds LLMs to produce rich attribute-based region descriptions, improving zero-shot part/implicit query grounding (Wang et al., 3 May 2025).
- Multi-task and collaboration: Joint training of RES with referring expression comprehension (REC) and generation (REG) achieves mutual benefits; disambiguation supervision and pseudo-expression generation scale up training data (Huang et al., 2022, Luo et al., 2020).
- Region-region, region-language explicit modeling: RIA and RLA attention mechanisms facilitate complex relational reasoning and robust multi-instance segmentation (Ding et al., 8 Jan 2026, Liu et al., 2023).
- Robustness: MBA (Multimodal Bidirectional Attack) exposes adversarial vulnerabilities unique to RES’s multimodal structure, requiring new defenses (Chen et al., 19 Jun 2025).
7. Challenges, Limitations, and Future Directions
Open issues and research frontiers include:
- Fine-grained part-level/results: Even SOTA MRES models (UniRES) underperform (<20% mIoU) on smallest/most challenging part queries (Wang et al., 2023).
- Weakly supervised segment quality: Pseudo-label quality, especially from weak signals (points/boxes), remains critical; active refinement and selection techniques are required (Huang et al., 2023).
- Spatial reasoning and coreference: 3D and multi-image scenarios necessitate entity-level spatial tracking, dependency-tree parsing, and explicit spatial losses for genuine disambiguation (Wu et al., 2024, Wu et al., 2023).
- Open-vocabulary/zero-shot coverage: LLM piping with attribute prompts is promising but limited by upstream proposal quality (e.g., SAM missing thin or occluded parts) (Wang et al., 3 May 2025).
- Scalability and annotation: Leveraging web-scale weakly annotated vision-language corpora, efficient bootstrapping, and multi-task federative learning is a major direction (Huang et al., 2023, Nag et al., 2024).
- Adversarial robustness: MBA’s findings suggest RES models require cross-modal adversarial defenses and prompt-randomization, not image-only hardening (Chen et al., 19 Jun 2025).
- Applications: Beyond traditional image segmentation, RES methods are being extended to video (temporal consistency), open-world visual reasoning, and embodied scene understanding for robotics (Nag et al., 2024, Chen et al., 9 Jan 2025).
In sum, RES and its generalizations have advanced from basic object-level pixel masking to complex, scalable, weakly-supervised, multi-instance, and open-vocabulary segmentation across 2D and 3D domains. Future progress will depend on principled multimodal fusion, spatially and semantically explicit architectures, learning from weak and large-scale annotations, and robustly grounding fine-grained, relational, and implicit queries (Ding et al., 8 Jan 2026, Li et al., 2024, Wang et al., 2023, Liu et al., 2 Apr 2025, Wang et al., 6 Aug 2025, Wu et al., 2024).