Papers
Topics
Authors
Recent
Search
2000 character limit reached

Referring Expression Segmentation (RES) Overview

Updated 15 January 2026
  • Referring Expression Segmentation (RES) is the task of generating pixel-level masks for image regions described by natural language, enabling open-vocabulary and multi-instance segmentation.
  • RES models integrate vision encoders and language transformers using cross-modal attention and prototype binding to precisely match spatial regions with linguistic cues.
  • Advances in RES include diverse supervision regimes—from fully to weakly and omni-supervised learning—and extend to complex domains like 3D point clouds and fine-grained part segmentation.

Referring Expression Segmentation (RES) is the task of producing pixel-level segmentation masks for regions of an image that correspond to a given natural language expression. Unlike standard semantic segmentation (assigning class labels to pixels), RES operates in an open-vocabulary regime and directly grounds free-form linguistic queries (e.g., “the man in the blue shirt next to the car”) to precise image regions. RES research has evolved from single-instance, fully-supervised settings to semi- and weakly-supervised protocols, open-vocabulary and multi-target generalizations, and even 3D point cloud domains. This article synthesizes technical developments, datasets, model architectures, supervision protocols, applications, and current challenges in RES and its generalizations.

1. Task Formulation and Variants

RES is formally defined as follows: given an image IRH×W×3I \in \mathbb{R}^{H \times W \times 3} and a referring expression TT, predict a binary mask M{0,1}H×WM \in \{0,1\}^{H \times W} such that pixels with M(x,y)=1M(x, y)=1 correspond to the region of II described by TT (Wang et al., 2023, Liu et al., 2023, Wang et al., 6 Aug 2025). The classic setting assumes exactly one object is referred to by TT. Recent work generalizes this to arbitrary cardinality (multi-object or part-level RES, zero-target/no-object cases) (Liu et al., 2023, Li et al., 2024, Ding et al., 8 Jan 2026).

Generalized RES (GRES) defines the output for each sample as mask MGTM_{GT} (the union of all regions referred to) and flag EGT{0,1}E_{GT}\in\{0,1\} (no-target indicator) (Ding et al., 8 Jan 2026). Multi-granularity RES (MRES) further admits queries referring to parts, mixtures, or multi-instance targets, requiring models to recognize and segment object-level, part-level, and composite regions (Wang et al., 2023, Liu et al., 2 Apr 2025).

2. Datasets and Benchmarks

RES has a rich suite of datasets:

3. Model Architectures and Methodologies

Modern RES architectures are principally multimodal, fusing vision and language representations:

4. Supervision Regimes: Full, Semi, Weak, and Omni-Supervised

RES has evolved past fully-supervised learning with dense mask annotations:

  • Semi-supervised learning: Teacher–student schemes utilize a small labeled set and large unlabeled corpus, employing techniques such as pseudo-label consistency, refined via models like SAM (Segment Anything Model) for precise boundaries (Yang et al., 2024, Zang et al., 2024). Specialized data augmentations, text perturbation, and confidence-guided training yield strong improvements (e.g., SemiRES achieves +18.64% IoU with 1% labels over supervised baseline) (Yang et al., 2024, Zang et al., 2024).
  • Weakly-supervised learning: Training uses click-, point-, or box-level annotations instead of full masks, with weak-label selection or active pseudo-mask refinement (APLR) for mask generation from weak signals (Huang et al., 2023, Nag et al., 2024). SafaRi introduces cross-modal fusion with attention consistency, bootstrapping robust segmentation from partial masks/boxes via spatially aware pseudo-label filtering (Nag et al., 2024).
  • Omni-supervised learning: Omni-RES unifies labeled, weakly labeled, and unlabeled data, employing teacher-student learning and mask refinement yardsticks (e.g., grounding points/boxes are used as criteria for pseudo-mask trustworthiness) (Huang et al., 2023). This mode allows leveraging large-scale external weakly annotated corpora (Visual Genome) to reach fully-supervised performance with 10% dense labels (Huang et al., 2023).

5. Evaluation Metrics and Results

Standard benchmarks and metrics:

Critical model results:

  • GRES/MRES: MABP sets state-of-the-art on gRefCOCO (val cIoU/gIoU: 65.7/68.9%; testB: 62.8/64.0%), outperforming ReLA and prior bests (Li et al., 2024). UniRES++—a unified MLLM architecture—achieves +8–10% part-level mIoU on RefCOCOm and +6–10% gains on gRefCOCO’s cIoU/gIoU (Liu et al., 2 Apr 2025).
  • Classic RES: UniRES++ reaches 80.8/82.4/77.0% mIoU on RefCOCO val/testA/testB, surpassing prior SOTA (Liu et al., 2 Apr 2025). MLLMSeg delivers 78.9% cIoU on RefCOCO with only 34M decoder parameters, matching heavy SAM-based approaches (Wang et al., 6 Aug 2025).
  • 3D-GRES: MDIN attains 47.5% mIoU and 44.7% [email protected] on Multi3DRes, +4.5—8.8 points over earlier methods (Wu et al., 2024).
  • Semi/Omni-supervised: SemiRES, Omni-RES, and SafaRi consistently outperform fully supervised protocols at low annotation rates; e.g., Omni-Box (10% labels) matches 100% supervision (Huang et al., 2023, Yang et al., 2024, Nag et al., 2024).
  • Zero-shot/LLM-based: RESAnything achieves gIoU/cIoU of 78.2/72.4% on ABO-Image-ARES and 74.6%/72.5% on ReasonSeg, far exceeding zero-shot baselines and challenging fine-tuned specialists (Wang et al., 3 May 2025).

Technical advances driving the field include:

  • Prototype and region query binding: Adaptive prototypes for region-specific matching enable flexible instance/part assignment, reducing pressure on encoder modules in multi-target/part settings (Li et al., 2024).
  • Attribute prompting and Chain-of-Thought reasoning: Scaffolds LLMs to produce rich attribute-based region descriptions, improving zero-shot part/implicit query grounding (Wang et al., 3 May 2025).
  • Multi-task and collaboration: Joint training of RES with referring expression comprehension (REC) and generation (REG) achieves mutual benefits; disambiguation supervision and pseudo-expression generation scale up training data (Huang et al., 2022, Luo et al., 2020).
  • Region-region, region-language explicit modeling: RIA and RLA attention mechanisms facilitate complex relational reasoning and robust multi-instance segmentation (Ding et al., 8 Jan 2026, Liu et al., 2023).
  • Robustness: MBA (Multimodal Bidirectional Attack) exposes adversarial vulnerabilities unique to RES’s multimodal structure, requiring new defenses (Chen et al., 19 Jun 2025).

7. Challenges, Limitations, and Future Directions

Open issues and research frontiers include:

  • Fine-grained part-level/results: Even SOTA MRES models (UniRES) underperform (<20% mIoU) on smallest/most challenging part queries (Wang et al., 2023).
  • Weakly supervised segment quality: Pseudo-label quality, especially from weak signals (points/boxes), remains critical; active refinement and selection techniques are required (Huang et al., 2023).
  • Spatial reasoning and coreference: 3D and multi-image scenarios necessitate entity-level spatial tracking, dependency-tree parsing, and explicit spatial losses for genuine disambiguation (Wu et al., 2024, Wu et al., 2023).
  • Open-vocabulary/zero-shot coverage: LLM piping with attribute prompts is promising but limited by upstream proposal quality (e.g., SAM missing thin or occluded parts) (Wang et al., 3 May 2025).
  • Scalability and annotation: Leveraging web-scale weakly annotated vision-language corpora, efficient bootstrapping, and multi-task federative learning is a major direction (Huang et al., 2023, Nag et al., 2024).
  • Adversarial robustness: MBA’s findings suggest RES models require cross-modal adversarial defenses and prompt-randomization, not image-only hardening (Chen et al., 19 Jun 2025).
  • Applications: Beyond traditional image segmentation, RES methods are being extended to video (temporal consistency), open-world visual reasoning, and embodied scene understanding for robotics (Nag et al., 2024, Chen et al., 9 Jan 2025).

In sum, RES and its generalizations have advanced from basic object-level pixel masking to complex, scalable, weakly-supervised, multi-instance, and open-vocabulary segmentation across 2D and 3D domains. Future progress will depend on principled multimodal fusion, spatially and semantically explicit architectures, learning from weak and large-scale annotations, and robustly grounding fine-grained, relational, and implicit queries (Ding et al., 8 Jan 2026, Li et al., 2024, Wang et al., 2023, Liu et al., 2 Apr 2025, Wang et al., 6 Aug 2025, Wu et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Referring Expression Segmentation (RES).