Grounded SAM: Open-Vocab Vision Pipeline

Updated 19 January 2026

Grounded SAM is an integrated vision pipeline combining open-vocabulary detection (Grounding DINO) with promptable segmentation (SAM) to enable flexible image analysis.
It achieves state-of-the-art zero-shot segmentation performance and modular plug-and-play capabilities for applications like automated annotation, image editing, and 3D human motion analysis.
The framework supports customizable task pipelines without monolithic retraining, though challenges include detection bias and computational latency.

Grounded SAM is an integrated vision pipeline that enables open-vocabulary detection and segmentation of image regions conditioned on arbitrary text inputs. By assembling Grounding DINO—a text-driven, open-set object detector—and the Segment Anything Model (SAM)—a promptable segmentation backbone—Grounded SAM achieves diverse visual tasks without monolithic retraining, providing modularity and extensibility for applications such as automated annotation, controllable image editing, and promptable 3D human motion analysis. This architecture shows state-of-the-art zero-shot performance on segmentation benchmarks, highlighting its utility as a plug-and-play assembly for open-world visual modeling (Ren et al., 2024).

1. Component Architecture and Workflow

Grounded SAM operates as a composition rather than a single end-to-end network. The main components are:

Grounding DINO (Liu et al., 2023): An open-set object detector that accepts an image $I\in\mathbb{R}^{H\times W \times 3}$ and a text prompt $T$ and outputs $N$ bounding boxes $\mathcal{B} = \{(b_i, s_i)\}_{i=1}^N$ where $b_i$ describes spatial coordinates and $s_i$ is a grounding score.
Segment Anything Model (SAM) (Kirillov et al., 2023): For each detected box $b_i$ , SAM predicts a binary mask $M_i \in \{0,1\}^{H \times W}$ , with the box serving as the prompt.

The process is illustrated as follows:

$N$ 7

Pseudocode:

$N$ 8

This architecture supports modular plug-ins and expert components for extended pipelines (e.g., BLIP, RAM, Stable-Diffusion, OSX).

2. Mathematical Formulation

The detection and segmentation stages are mathematically defined as follows:

Open-set Detection (Grounding DINO):

Text embedding: $E_{\text{text}}(T) \in \mathbb{R}^d$
Visual features: $E_{\text{img}}(I) \in \mathbb{R}^{H'\times W'\times C}$
Learned object queries $T$ 0 attend to $T$ 1.
For each query, classification score $T$ 2 and bounding box $T$ 3.
Grounding score:

$T$ 4

where $T$ 5 denotes sigmoid.

Training loss:

$T$ 6

Promptable Segmentation (SAM):

Segmentation mask predicted for each box:

$T$ 7

Mask loss:

$T$ 8

Overall Objective: No joint fine-tuning is performed; both models are used off-the-shelf. For end-to-end training, the objective becomes:

$T$ 9

3. Task Pipelines and Representative Use Cases

Grounded SAM serves as a foundation for a variety of task-specific pipelines:

3.1 Automatic Dense Annotation

BLIP-Grounded-SAM: BLIP captioner produces captions from $N$ 0; each noun phrase in caption $N$ 1 prompts GroundedSAM, yielding $N$ 2 for object instance annotation.
RAM-Grounded-SAM: Recognize Anything Model (RAM) generates tags $N$ 3; each tag serves as a prompt for instance mask generation.

Example:

Input	Output
I: cow/rainbow	Box: (100, 30, 320, 260) Mask: cow

3.2 Controllable Image Editing

User supplies prompt $N$ 4 (e.g. “remove dog”). GroundedSAM produces relevant mask $N$ 5, fed into Stable Diffusion inpainting for content manipulation.

3.3 Promptable 3D Human Motion Analysis

Prompt $N$ 6 (“person in pink shirt”) yields instance mask and box, cropped to run OSX mesh recovery for obtaining parametric 3D body meshes.

4. Performance on Open-Vocabulary Benchmarks

On the SegInW zero-shot segmentation benchmark spanning 25 datasets, Grounded SAM demonstrates strong results:

Method	mean SGinW
X-Decoder-L	32.2
OpenSeeD-L	36.7
ODISE-L	38.7
SAN-CLIP-ViT-L	41.4
UNINEXT-Huge	42.1
Grounded-HQ-SAM (Base+DINO + HQ-SAM)	49.6
Grounded-SAM (Base+DINO + SAM-Huge)	48.7

Grounded SAM (DINO-Base + SAM-Huge) achieves mean AP 48.7, a significant improvement (+6 points) over the next best competitor. Integration of SAM-HQ mask backbone further raises performance to 49.6.

5. Modularity, Extensibility, and Limitations

Strengths:

Open vocabulary: Directly localizes arbitrary text queries ("Gazania linearis," "Zale Horrida") without fine-tuning.
Modularity: Each step (detection, segmentation, downstream application) is independently inspectable.
Plug-and-play: Easily replace or augment components (e.g., FastSAM for speed, LLM controllers for prompt routing).

Limitations:

Detection bias and coverage depend on pre-trained detectors; exotic or diminutive entities may elude detection.
Inference latency arises from cascading two large backbones; lighter alternatives (FastSAM, MobileSAM) offer remediation.

A plausible implication is that modularity enables rapid assembly of tailored pipelines while maintaining interpretability of intermediate outputs, although limitations in coverage and computational cost persist.

6. Prospective Directions

Proposed future extensions include:

End-to-end fine-tuning: To improve alignment across detection and segmentation stages.
Vision–language large model controllers: Automating expert selection and pipeline routing.
Video and 3D extension: Incorporation of transformer-tracking (e.g., DEVA), NeRFs, or depth estimation modules for temporal and spatial modeling.

These directions suggest that the framework can be adapted for more complex modalities and enhanced automation, leveraging its inherent modular architecture.

Markdown Report Issue Upgrade to Chat

References (3)

Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks (2024)

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection (2023)

Segment Anything (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GroundedSAM.