Papers
Topics
Authors
Recent
Search
2000 character limit reached

Grounded SAM: Open-Vocab Vision Pipeline

Updated 19 January 2026
  • Grounded SAM is an integrated vision pipeline combining open-vocabulary detection (Grounding DINO) with promptable segmentation (SAM) to enable flexible image analysis.
  • It achieves state-of-the-art zero-shot segmentation performance and modular plug-and-play capabilities for applications like automated annotation, image editing, and 3D human motion analysis.
  • The framework supports customizable task pipelines without monolithic retraining, though challenges include detection bias and computational latency.

Grounded SAM is an integrated vision pipeline that enables open-vocabulary detection and segmentation of image regions conditioned on arbitrary text inputs. By assembling Grounding DINO—a text-driven, open-set object detector—and the Segment Anything Model (SAM)—a promptable segmentation backbone—Grounded SAM achieves diverse visual tasks without monolithic retraining, providing modularity and extensibility for applications such as automated annotation, controllable image editing, and promptable 3D human motion analysis. This architecture shows state-of-the-art zero-shot performance on segmentation benchmarks, highlighting its utility as a plug-and-play assembly for open-world visual modeling (Ren et al., 2024).

1. Component Architecture and Workflow

Grounded SAM operates as a composition rather than a single end-to-end network. The main components are:

  • Grounding DINO (Liu et al., 2023): An open-set object detector that accepts an image IRH×W×3I\in\mathbb{R}^{H\times W \times 3} and a text prompt TT and outputs NN bounding boxes B={(bi,si)}i=1N\mathcal{B} = \{(b_i, s_i)\}_{i=1}^N where bib_i describes spatial coordinates and sis_i is a grounding score.
  • Segment Anything Model (SAM) (Kirillov et al., 2023): For each detected box bib_i, SAM predicts a binary mask Mi{0,1}H×WM_i \in \{0,1\}^{H \times W}, with the box serving as the prompt.

The process is illustrated as follows:

NN7

Pseudocode:

NN8

This architecture supports modular plug-ins and expert components for extended pipelines (e.g., BLIP, RAM, Stable-Diffusion, OSX).

2. Mathematical Formulation

The detection and segmentation stages are mathematically defined as follows:

Open-set Detection (Grounding DINO):

  • Text embedding: Etext(T)RdE_{\text{text}}(T) \in \mathbb{R}^d
  • Visual features: Eimg(I)RH×W×CE_{\text{img}}(I) \in \mathbb{R}^{H'\times W'\times C}
  • Learned object queries TT0 attend to TT1.
  • For each query, classification score TT2 and bounding box TT3.
  • Grounding score:

TT4

where TT5 denotes sigmoid.

  • Training loss:

TT6

Promptable Segmentation (SAM):

  • Segmentation mask predicted for each box:

TT7

  • Mask loss:

TT8

Overall Objective: No joint fine-tuning is performed; both models are used off-the-shelf. For end-to-end training, the objective becomes:

TT9

3. Task Pipelines and Representative Use Cases

Grounded SAM serves as a foundation for a variety of task-specific pipelines:

3.1 Automatic Dense Annotation

  • BLIP-Grounded-SAM: BLIP captioner produces captions from NN0; each noun phrase in caption NN1 prompts GroundedSAM, yielding NN2 for object instance annotation.
  • RAM-Grounded-SAM: Recognize Anything Model (RAM) generates tags NN3; each tag serves as a prompt for instance mask generation.

Example:

Input Output
I: cow/rainbow Box: (100, 30, 320, 260) Mask: cow

3.2 Controllable Image Editing

User supplies prompt NN4 (e.g. “remove dog”). GroundedSAM produces relevant mask NN5, fed into Stable Diffusion inpainting for content manipulation.

3.3 Promptable 3D Human Motion Analysis

Prompt NN6 (“person in pink shirt”) yields instance mask and box, cropped to run OSX mesh recovery for obtaining parametric 3D body meshes.

4. Performance on Open-Vocabulary Benchmarks

On the SegInW zero-shot segmentation benchmark spanning 25 datasets, Grounded SAM demonstrates strong results:

Method mean SGinW
X-Decoder-L 32.2
OpenSeeD-L 36.7
ODISE-L 38.7
SAN-CLIP-ViT-L 41.4
UNINEXT-Huge 42.1
Grounded-HQ-SAM (Base+DINO + HQ-SAM) 49.6
Grounded-SAM (Base+DINO + SAM-Huge) 48.7

Grounded SAM (DINO-Base + SAM-Huge) achieves mean AP 48.7, a significant improvement (+6 points) over the next best competitor. Integration of SAM-HQ mask backbone further raises performance to 49.6.

5. Modularity, Extensibility, and Limitations

Strengths:

  • Open vocabulary: Directly localizes arbitrary text queries ("Gazania linearis," "Zale Horrida") without fine-tuning.
  • Modularity: Each step (detection, segmentation, downstream application) is independently inspectable.
  • Plug-and-play: Easily replace or augment components (e.g., FastSAM for speed, LLM controllers for prompt routing).

Limitations:

  • Detection bias and coverage depend on pre-trained detectors; exotic or diminutive entities may elude detection.
  • Inference latency arises from cascading two large backbones; lighter alternatives (FastSAM, MobileSAM) offer remediation.

A plausible implication is that modularity enables rapid assembly of tailored pipelines while maintaining interpretability of intermediate outputs, although limitations in coverage and computational cost persist.

6. Prospective Directions

Proposed future extensions include:

  • End-to-end fine-tuning: To improve alignment across detection and segmentation stages.
  • Vision–language large model controllers: Automating expert selection and pipeline routing.
  • Video and 3D extension: Incorporation of transformer-tracking (e.g., DEVA), NeRFs, or depth estimation modules for temporal and spatial modeling.

These directions suggest that the framework can be adapted for more complex modalities and enhanced automation, leveraging its inherent modular architecture.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GroundedSAM.