Papers
Topics
Authors
Recent
Search
2000 character limit reached

SAM 3: Open-Vocabulary Concept Segmentation

Updated 16 February 2026
  • SAM 3 is an open-vocabulary segmentation model that introduces promptable concept segmentation (PCS) for effective zero-shot instance and semantic segmentation.
  • It employs a unified transformer-based architecture with a shared backbone, DETR-style detection, and memory-driven tracking to fuse recognition with localization.
  • It demonstrates state-of-the-art performance on general, medical, geospatial, and instruction-following benchmarks through extensive data curation and efficient adaptations.

Segment Anything Model 3 (SAM 3) is an open-vocabulary, prompt-driven segmentation system capable of detecting, segmenting, and tracking arbitrary concepts in both images and videos based on textual or visual prompts. Developed to unify recognition and segmentation across diverse data modalities, SAM 3 supersedes previous generations by introducing Promptable Concept Segmentation (PCS), a new architectural paradigm for large-scale, zero-shot instance and semantic segmentation. PCS enables segmentation of all instances matching a natural-language concept or visual exemplar, with the model trained on a broad data engine encompassing millions of concepts and hard negatives. The architecture includes a shared transformer backbone, DETR-style object detector, memory-driven tracker, and a decoupled presence head to distinguish between recognition and localization. SAM 3 is released alongside the SA-Co benchmark and demonstrates state-of-the-art performance across general, medical, geospatial, and instruction-following benchmarks.

1. Concept, Problem Setting, and Motivation

SAM 3 extends the Segment Anything paradigm by moving from point, box, or mask prompts (SAM/SAM2) to open-vocabulary “concept prompts”—short noun phrases (e.g., "yellow school bus"), image exemplars, or both. This shift enables the Promptable Concept Segmentation (PCS) setting: given an image or video and a prompt, the model segments and tracks all object instances matching the concept, optionally over time. The underlying motivation is to support one-shot or zero-shot generalization to an essentially unbounded set of categories, moving beyond fixed, closed vocabularies or geometric-only prompting (Carion et al., 20 Nov 2025).

This capability is instantiated in domains ranging from natural scenes (SA-Co, COCO, LVIS) to specialized environments such as remote sensing (where object categories are sparse but vocabulary size is large) (Li et al., 9 Dec 2025), instruction-following with arbitrarily complex queries (Li et al., 4 Dec 2025), medical image segmentation (Chen et al., 24 Nov 2025), and 3D perception for robotics (Dong et al., 8 Dec 2025).

2. Architecture: Unified Promptable Concept Segmentation System

2.1 Shared Backbone and Prompt Fusion

The core of SAM 3 is a single, deep Vision Transformer-based backbone ("Perception Encoder," typically ~450M parameters) trained contrastively on 5.4 billion image–text pairs (Carion et al., 20 Nov 2025). Visual input (image, or video frame) is embedded into a dense token map, while prompts—for PCS—are simultaneously encoded via a CLIP-style text encoder and an ROI/exemplar transformer (for cropped image prompts), then concatenated as prompt tokens.

A fusion encoder cross-attends these prompt tokens with the image feature map, enabling the DETR-style instance decoder to carry prompt-conditioned semantics into detection and mask prediction (Carion et al., 20 Nov 2025, Zeng et al., 19 Nov 2025).

2.2 Detection and Mask Heads

Prompt-conditioned features flow into three key, decoupled heads (Li et al., 9 Dec 2025):

  • Presence Head: Predicts global presence score Spres[0,1]S_{\rm pres} \in [0,1], answering whether the concept is present anywhere in the image (Li et al., 9 Dec 2025). This decouples recognition ("is it there?") from localization ("where is it?"), mitigating false positives in open-vocabulary regimes with many absent classes (Zeng et al., 19 Nov 2025).
  • Semantic Segmentation Head: Dense per-pixel classifier Psem(h,w)[0,1]P_{\rm sem}(h, w)\in [0, 1] for each location, excelling at amorphous "stuff" regions such as land or roads.
  • DETR-Style Instance Head: Query-driven decoder outputs NN instance masks {Pinst(k),sconf(k)}k=1N\{P_{\rm inst}^{(k)}, s_{\rm conf}^{(k)}\}_{k=1}^N, specializing in "thing" objects with clear spatial boundaries (Li et al., 9 Dec 2025).

2.3 Video Tracker and Memory

For video PCS, SAM 3 incorporates a memory-driven tracker. For each track/object, a memory bank aggregates embeddings and masklets over time (Zeng et al., 19 Nov 2025), supporting identity-aware tracking and temporal coherence.

2.4 Training Losses

Training is performed over detection/classification (box, mask, presence, cross-entropy, and dice/focal losses), plus propagation and matching losses for tracking (Carion et al., 20 Nov 2025, Zeng et al., 19 Nov 2025). Additional objectives are included for instruction-following (KL divergence, hard-region supervision (Li et al., 4 Dec 2025)) and auxiliary domains such as depth estimation in surgical robotics (Dong et al., 8 Dec 2025).

3. Training Data, Benchmarks, and Evaluation

3.1 SA-Co Dataset and Data Engine

The training infrastructure behind SAM 3 is the SA-Co dataset and data engine (Carion et al., 20 Nov 2025, Zeng et al., 19 Nov 2025). SA-Co encompasses:

  • 5.2M high-quality images, 4M unique noun phrases, 52M instance masks.
  • 52.5K videos, 467K object tracks ("masklets").
  • Extensive hard negatives: for each prompt–image pair, ~88% are verified negatives.
  • Multi-phase, human+AI annotation, exhaustively correcting and validating mask–prompt pairs, with domain expansion across 15 media types.

Additional data sources include synthetic and external datasets (1.7B image–NP pairs) and domain-specific conversions (e.g., for open-vocabulary semantic segmentation (Li et al., 9 Dec 2025), PACO-LVIS-Instruct (Li et al., 4 Dec 2025)).

3.2 Benchmark Performance

SAM 3 demonstrates leading performance:

  • General Scene Segmentation: On SA-Co (Gold), LVIS, and COCO, zero-shot 1 scores reach 54.1 (vs. prior 24.6, human upper bound 72.8), LVIS mask mAP 37.2 (Carion et al., 20 Nov 2025).
  • Video PCS: SA-V: 30.3 IoU vs. prior 0.1–2.3; YT-1B: 50.8 (Carion et al., 20 Nov 2025).
  • Remote Sensing: On 8 multi-class benchmarks, mean IoU = 53.4% versus prior 40.7%; achieves 86.9% foreground IoU (WHU-Aerial, buildings; prior 49.2) (Li et al., 9 Dec 2025).
  • Medical/Low-SNR: Camouflage detection (CHAMELEON): SαS_\alpha=0.944, Fβ_\beta=0.908, MAE=0.016, exceeding previous SAM-based and UNet baselines (Chen et al., 24 Nov 2025).
  • Instruction Segmentation: SAM3-I matches base SAM 3 on noun phrases and exceeds agentic pipelines on complex instructions: concept/simple/complex gIoU=48.9/54.0/51.0 (Li et al., 4 Dec 2025).
  • 3D Perception (Surgical Robotics): Zero-shot depth on SCARED: Abs Rel=0.072, outperforming supervised Endo3R (0.124) (Dong et al., 8 Dec 2025).

4. Specialized Adaptations and Extensions

4.1 Efficient, Domain-Adapted, and Parameter-Efficient Variants

  • SAM3-Adapter: Injects small, stage-wise learned bottlenecks into the (frozen) transformer backbone, adding ∼2M parameters. Demonstrates strong gains and negligible overhead in camouflage detection, shadow detection (ISTD, BER=1.14), polyp segmentation (Kvasir-SEG, mDice=0.906), and cell segmentation (NeurIPS 2022, F1=0.7525) (Chen et al., 24 Nov 2025).
  • SAM3-UNet: Combines the frozen PE backbone, parameter-efficient pre-attention adapters, and a lightweight U-Net style decoder (6.3M trainable parameters out of 446M total), achieving state-of-the-art in mirror detection and salient object detection with low GPU memory costs (<6GB at batch 12, 336x336) (Xiong et al., 1 Dec 2025).
  • EfficientSAM3: Uses Progressive Hierarchical Distillation from SAM3 to compact student architectures (TinyViT, RepViT) for on-device concept segmentation and video tracking (Zeng et al., 19 Nov 2025).

4.2 Instruction-Following (SAM3-I)

SAM3-I extends the model to handle arbitrary natural-language instructions using cascaded adapters (S-Adapter and C-Adapter) in the text encoder, and alignment losses to enforce mask coherence across simple and complex instructions. Trained on the PACO-LVIS-Instruct set (45K images, 210K masks, 840K instructions), SAM3-I achieves strong instruction generalization with no loss in base PCS (Li et al., 4 Dec 2025).

4.3 Remote Sensing and Geospatial Segmentation

In remote sensing OVSS, SAM 3’s fusion of semantic and instance heads with presence-score gating is essential. The mask fusion strategy aggregates instance masks and combines them with semantic masks via per-pixel maximum, then gates the result with the presence probability. This dual pathway resolves the challenge of segmenting dense clusters of small "things" and amorphous "stuff" in large-patch geospatial imagery (Li et al., 9 Dec 2025).

Table: Representative Benchmark Results

Domain Task/Metric SAM 3/SAM3-extension Best prior SAMx / Model
SA-Co (Gold) Image PCS, 1 54.1 24.6 (OWLv2*)
Remote Sensing Multi-class mIoU (%) 53.4 (cvprpink) 40.7 (CorrCLIP)
Medical Polyp mDice (Kvasir-SEG) 0.906 (SAM3-Adapter) 0.873 (SAM2-Adapter)
Instruction Complex gIoU 51.0 (SAM3-I) 48.2 (SAM3+Agent,8B)
Video PCS (SA-V test) IoU (%) 30.3 2.3 (LLMDet+SAM2)
Saliency (DUTS-TE) Sα/Eφ/MAE 0.936/0.964/0.019 (SAM3-UNet) 0.934/0.959/0.020

All numbers are verbatim from cited papers. For domain-specific methods, refer to: (Carion et al., 20 Nov 2025, Li et al., 9 Dec 2025, Chen et al., 24 Nov 2025, Li et al., 4 Dec 2025, Xiong et al., 1 Dec 2025).

5. Algorithmic Insights and Component Analysis

5.1 Decoupled Recognition and Localization

The presence head (producing ppresp_{\mathrm{pres}}) is central to PCS: it enables the model to predict the occurrence of a concept independently from the location of any instance, which is critical in open-vocabulary and multi-head settings to suppress spurious detections or hallucinations. Empirical ablations show presence gating improves 1 by up to 1.5 points on SA-Co (Carion et al., 20 Nov 2025).

5.2 Fusion of Semantic and Instance Predictions

In modalities such as remote sensing, neither the semantic head nor the instance head alone suffices for optimal segmentation: semantic heads capture large, homogeneous regions; instance heads separate clustered "things" but fragment amorphous "stuff." The fusion approach, which takes Pfused(h,w)=max(Psem(h,w),Pinst_agg(h,w))P_{\rm fused}(h, w) = \max(P_{\rm sem}(h, w), P_{\rm inst\_agg}(h, w)) where Pinst_aggP_{\rm inst\_agg} is the confidence-weighted aggregation over instance masks, is essential for comprehensive coverage (Li et al., 9 Dec 2025).

5.3 Adapters and Efficient Fine-Tuning

Adapters—bottlenecked, residual modules injected at key points in the frozen backbone—enable efficient domain adaptation without loss of generalization. Most variants use linear down-projection, non-linearity, and up-projection; the weight count is kept below 2–6M, with negligible computational overhead (Chen et al., 24 Nov 2025, Xiong et al., 1 Dec 2025).

5.4 Data Engine and Negative Mining

The data engine’s exhaustive curation pipeline, including hard-negative mining, synthetic negatives, and multiple rounds of automated/human verification, is fundamental to open-vocabulary robustness and recognition performance in the presence of confounders (Carion et al., 20 Nov 2025).

6. Limitations and Directions for Future Research

Known limitations include:

  • Natural-language interface restrictiveness: The base SAM 3 only processes short noun phrases. Handling full referring expressions or complex instructions originally requires MLLM-based agent pipelines; extensions such as SAM3-I partially address this but broader, larger instruction corpora are needed for foundation-model scale instruction segmentation (Li et al., 4 Dec 2025).
  • Domain Adaptation: Zero-shot segmentation in niche modalities (thermal, pathology, surgical scenes) often suffers due to vocabulary or visual gap; domain adaptation or further data curation is required (Dong et al., 8 Dec 2025).
  • Inference Scalability: Inference cost scales with the number of object instances; near real-time inference is feasible for 5–10 objects on 2–4 modern GPUs but is not on-device (Zeng et al., 19 Nov 2025).
  • Video Memory and Disambiguation: The current memory tracker operates per-concept without global object-level sharing; future work may seek efficient cross-object/disambiguated memory (Carion et al., 20 Nov 2025).
  • 3D Perception: Although superior in monocular depth (Abs Rel 0.072, SCARED), SAM 3D’s current FPS (0.16) is insufficient for real-time, and failure cases abound under adverse surgical conditions (Dong et al., 8 Dec 2025).

Emerging research focuses on instruction-following with complex semantics (SAM3-I), video tracking with global memory, improved efficiency (EfficientSAM3), and domain scaling for medical, geospatial, and robotic perception tasks.

7. Open-Source Release and Community Adoption

SAM 3 is released under the SAM license through https://github.com/facebookresearch/sam3 with full models, Docker images, and web demos, accompanied by the SA-Co benchmark and comprehensive open-vocabulary evaluation scripts (Carion et al., 20 Nov 2025). Downstream codebases and extensions—including SegEarth-OV3 (Li et al., 9 Dec 2025), SAM3-I (Li et al., 4 Dec 2025), SAM3-Adapter (Chen et al., 24 Nov 2025), SAM3-UNet (Xiong et al., 1 Dec 2025), and domain-specific fine-tuning workflows—are available for community research and practical deployment.


SAM 3’s development marks a substantial milestone in prompt-driven, open-vocabulary segmentation, combining large-scale text–image pretraining, unified detection/tracking, explicit recognition–localization decoupling, and extensibility across domains and task types. Its architecture, data curation pipeline, and suite of adaptations serve as a foundation for ongoing progress in generalizable, semantics-grounded visual segmentation.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Segment Anything Model 3 (SAM 3).