Segment Anything 3 (SAM3): Open-Vocabulary Segmentation

Updated 21 January 2026

The paper introduces SAM3, a foundation vision model that employs promptable concept segmentation to unify image and video analysis, achieving significant metric improvements such as enhanced cgF1 scores.
SAM3 is a unified vision and text transformer model that uses noun phrases and image exemplars to enable flexible, open-ended instance segmentation across diverse domains.
SAM3’s scalable data engine and modular adapters facilitate efficient adaptation for low-level vision, 3D perception, and instruction-driven segmentation tasks while decoupling recognition from localization.

Segment Anything 3 (SAM3) is a foundation vision model that unifies open-vocabulary instance segmentation in both images and videos through the concept of promptable concept segmentation (PCS). It represents a major architectural shift in the SAM family, introducing concept-level language prompts (“noun phrases,” image exemplars, or combinations) to provide flexible, open-ended segmentation capabilities far beyond the closed-category, spatially grounded segmentation found in SAM1 and SAM2. SAM3 further expands its reach via robust presence detection, decoupled recognition and localization, scalable data annotation methodologies, and modular adaptation strategies. The model sets new state-of-the-art performance on PCS benchmarks and lays the groundwork for practical adaptation to low-level vision, 3D perception, and rich language-driven instruction following.

1. Unified Architecture for Open-Vocabulary Segmentation

SAM3 utilizes a single Vision + Text transformer backbone, termed the Perception Encoder, which jointly embeds images and text prompts into a fused token space. The image encoder is a hierarchical ViT-style model with both local (windowed) and global attention. The text encoder—context length 32, causal—undergoes contrastive pretraining on 5.4 billion image-text pairs, ensuring robust alignment in the shared representation space. Prompt encoding flexibly supports short noun phrases (NPs), image exemplars, or combinations via parallel embedding and cross-attention into the backbone.

The detection stage employs a DETR-based fusion encoder and transformer decoder, incorporating learned query tokens (object queries and a presence token) to predict boxes, segmentation masks, and category-agnostic semantic maps. The presence head estimates the likelihood of a queried concept’s presence in the image, decoupling recognition (“what”) from localization (“where”) and reducing both false positives and negatives in open-vocabulary settings (Carion et al., 20 Nov 2025).

Promptable concept segmentation (PCS) arises by inputting these prompts and returning all foreground masks matching the “concept”—thus segmenting all instances (across thousands of open-world objects) without prior knowledge of closed-vocabulary categories.

In video, the architecture instantiates a memory-based tracker. Per-object “masklets” are propagated through the video by a lightweight transformer with temporal disambiguation strategies, periodic re-detection, presence estimation, and duplicate suppression. The backbone, detector, and tracker directly share weights, preventing task interference and ensuring consistency across frame sequences.

2. Prompting and Presence Decoupling

SAM3 majorly expands the granularity and expressiveness of prompts. A “concept prompt” consists of:

Noun Phrase (NP): Simple text, e.g., “yellow school bus” or “striped cat.”
Image Exemplar: Region-of-interest box with positive/negative label, often paired for interactive refinement.
Prompt Combinations: Simultaneous use of NP and exemplars for nuanced specification.

Encoded prompt tokens condition the Perception Encoder through cross-attention in the fusion module and transformer decoder, with subsequent heads making segmentation, detection, and semantic predictions aligned to the conditioned concept space.

The presence head is a learned presence token included in every decoder call. For input prompt $P$ , it outputs $\pi = p(\mathrm{NP\ present}) \in [0, 1]$ . Object queries independently produce localization-only scores $\ell_i = p(q_i\ \text{matches}\ \mathrm{NP}|\mathrm{NP\ present})$ . The final confidence for each mask combines these as $s_i = \pi \cdot \ell_i$ . Presence is supervised by binary cross-entropy against image-level labels, with query scores supervised (via Hungarian assignment) only when $\mathrm{NP}$ is present.

This decoupling sharply improves negative suppression: ablation shows cgF1 (classification-gated F1) increases by +1.5 (from 50.7→52.2) and image-level MCC by +0.05 when the presence head is activated (Carion et al., 20 Nov 2025).

3. Scalable Data Engine and Training Pipeline

SAM3’s breakthroughs are supported by a multi-phase data engine engineered to produce massive, diverse, and high-quality prompt–mask pairs with hard negatives.

Human-verified annotation: BLIP-2 and OWLv2 produce pseudo-labels, with human Mask and Exhaustivity Verification (MV/EV) on a 4.3M image–NP core set.
AI verifiers: Llama 3.2 fine-tuned as MV/EV classifiers enable large-scale automated label validation and rapid expansion to 122M additional pairs.
Domain expansion: Fifteen target domains (medical, microscopy, art, robotics, driving, food, etc.) yield 19.5M more pairs via LLM/ontology-guided NP mining and hard negative adversaries.
Synthetic images: Generation of 1.7B NP pairs and 1.4B masks via auto-verified, programmatic synthesis.
Videos: LLM and human judges curate 52.5K videos, 134K video–NPs, and 467K masklets, supporting spatiotemporal tracking.

Training proceeds through PE pretraining, DETR-based detector pretraining (with box, mask, presence, and classification losses), followed by detector finetuning (interactivity, presence), and then frozen-backbone video tracker training.

Data augmentation is extensive and includes random crop/resizing, flip, negative sampling, and mosaic augmentation to boost open-vocabulary precision. Hard-negative mining dramatically increases image-level locality (IL) and cgF1; e.g., expanding from 0→30 hard negatives adds IL +0.24 (0.44→0.68), cgF1 +14.7 (Carion et al., 20 Nov 2025).

4. Performance, Evaluation Metrics, and Empirical Results

SAM3 sets new state of the art in promptable concept segmentation across extensive benchmarks:

Image PCS (SA-Co “Gold” split): cgF1 = 54.1 (OWLv2: 17.3, GroundingDINO: 3.3, Gemini 2.5: 13.0)
Image PCS (LVIS zero-shot): masked AP = 48.5 vs. OWLv2 53.0, despite no base training on LVIS (Carion et al., 20 Nov 2025)
Video PCS (SA-V test): cgF1 = 30.3, pHOTA = 58.0, significantly surpassing GLEE, LLMDet+Tracker, and detector-only variants.

Metrics include:

cgF1: Classification-gated F1, integrating micro-F1 on masks with MCC for image-level presence.
pHOTA: phrase-based higher order tracking accuracy, combining association and detection via (video, NP) remapping.
TETA: Track Every Thing Accuracy for multi-object tracking.

Video tracking is robust under complex, crowded conditions but the original simultaneous multi-target memory update—group-level coupled gating—can lead to identity drifts in cases of occlusion or disappearance. SAM3-DMS (Decoupled Memory Selection) replaces the group-averaged gating by per-object gating: each memory bank $\mathcal M_i$ is updated if $S_{i,t} = q_{i,t} \cdot p_t > \tau$ . This yields monotonic improvements as target density increases, virtually eliminating “polluted blank mask” drifts and ID switches at no computational cost (Shen et al., 14 Jan 2026).

5. Adaptation, Lightweight Distillation, and Low-Level Vision

SAM3’s architectural modularity allows for efficient downstream adaptations:

SAM3-Adapter: Frozen-backbone adapters injected at multiple encoder stages, tuned on domain-specific or low-level tasks with only a few million parameters. Strong state-of-the-art results are obtained in medical, camouflaged, and shadow segmentation, and polyp/cell segmentation, with <5% parameter overhead (Chen et al., 24 Nov 2025).
SAM3-UNet: Lightweight parameter-efficient adapters and a U-Net-style 4-level decoder attached to the frozen vision backbone, requiring <6GB GPU memory for a batch size of 12. Obtainable IoU/F1 improvements over SAM2-UNet of 2–8 points, efficient for mirror and salient object detection (Xiong et al., 1 Dec 2025).
EfficientSAM3: Progressive Hierarchical Distillation (PHD) produces a suite of student models (RepViT, TinyViT, EfficientViT) spanning 0.7–21 M parameters, distilling PCS, presence, and memory-based tracking for on-device applications. The three-stage schedule—encoder distillation, memory distillation with Perceiver, end-to-end fine-tuning—transfers spatiotemporal and concept-level performance (Zeng et al., 19 Nov 2025).

Model Name	Backbone	Parameters
ES-RV-S	RepViT-M0.9	5.1 M
ES-TV-S	TinyViT-5M	5.4 M
ES-EV-L	EfficientViT-B2	15.0 M

Adapters inject task-specific information via MLPs or low-rank projections; almost all backbone parameters remain frozen, supporting sample-efficient fine-tuning.

6. Language, Instruction, and Complex Reasoning

SAM3 natively handles open-vocabulary NP prompts but, as deployed, struggles with complex referring expressions involving attributes, relations, or reasoning. The SAM3-I enhancement introduces a cascaded, instruction-aware adaptation pipeline:

S-Adapter: Handles simple instructions (NP + attribute/relations) using down/upsampling blocks and multi-head self-attention.
C-Adapter: Added for complex instructions (multi-hop, functional, reasoning, “the player sliding on the grass to tackle the ball”).
Training employs a curriculum aligned to instruction complexity, with uncertainty-aware hard-region supervision and KL-based alignment losses enforcing semantic consistency between instruction types.

On PACO-LVIS-Instruct, SAM3-I exactly preserves original concept-level performance (48.9 gIoU/54.1 P@50), but achieves 54.0 gIoU/59.6 P@50 on simple and 51.0 gIoU/56.4 P@50 on complex instructions. This yields absolute gains of +12.4/+14.0 and +2.8/+4.1, respectively, over agent-based reranking pipelines, with no need for external LLMs (Li et al., 4 Dec 2025).

Limitations include a focus on N-to-1 instance grounding; future work targets group reasoning and 1-to-N scaling, as well as Mixture-of-Experts adaptation to further improve instruction following.

7. Extensions in Specialized Domains and 3D Perception

SAM3’s tri-headed, promptable design has been adapted for the remote sensing setting via SegEarth-OV3, which fuses semantic and instance head predictions and applies presence-based filtering, boosting mean IoU from 40.7% (CorrCLIP) to 53.4% without any retraining (Li et al., 9 Dec 2025). This approach achieves state-of-the-art zero-shot geospatial segmentation across 17 benchmarks.

In 3D and surgical applications, SAM3 extends to monocular depth prediction (via a lightweight “depth head”) and 3D point cloud reconstruction. On SCARED, SAM3D achieves lower RMSE/Abs Rel than supervised clinical models, though inference speed is low (∼0.16 FPS) and language prompting remains challenging due to medical domain gap (Dong et al., 8 Dec 2025).

8. Limitations and Future Directions

Despite significant advances, open challenges remain:

Finer-grained, domain-specialized concepts (e.g., rare aircraft, medical conditions) often require either synthetic data or minimal fine-tuning.
The instruction-following pipeline is not fully end-to-end for group scenarios; current architectures focus on N-to-1 settings.
In multi-target video, even per-object memory gating does not eliminate ID confusion in highly ambiguous or rapid motion scenes.
For real-time and on-device use, further efficiency gains are necessary; student models via PHD distillation offer promising trade-offs, yet do not reach the full performance envelope of the backbone.
Natural language instruction following and broader zero-shot reasoning are limited by the instruction corpora and current fixed branch heuristics in cascaded adapters.

Future research directions include adaptive mixture-of-experts routing for instruction layers, unified memory paradigms for video, larger-scale instructional data (analogous to SA-1B), and expansion to full instruction-driven foundation models (Li et al., 4 Dec 2025, Carion et al., 20 Nov 2025).

Key References:

"SAM 3: Segment Anything with Concepts" (Carion et al., 20 Nov 2025)
"SAM3-I: Segment Anything with Instructions" (Li et al., 4 Dec 2025)
"SAM3-DMS: Decoupled Memory Selection for Multi-target Video Segmentation of SAM3" (Shen et al., 14 Jan 2026)
"SAM3-Adapter: Efficient Adaptation of Segment Anything 3..." (Chen et al., 24 Nov 2025)
"EfficientSAM3: Progressive Hierarchical Distillation..." (Zeng et al., 19 Nov 2025)
"SegEarth-OV3: Exploring SAM 3 for Open-Vocabulary Semantic Segmentation..." (Li et al., 9 Dec 2025)
"More than Segmentation: Benchmarking SAM 3 for Segmentation, 3D Perception..." (Dong et al., 8 Dec 2025)
"SAM3-UNet: Simplified Adaptation of Segment Anything Model 3" (Xiong et al., 1 Dec 2025)