SAM 3: Unified Visual Segmentation & Tracking

Updated 31 January 2026

SAM 3 is an open-vocabulary visual segmentation model that integrates text and image prompts to detect, segment, and track objects in images and videos.
It features a unified image-video backbone and transformer-based prompt encoding to refine instance detection with robust cross-attention mechanisms.
The model achieves significant performance gains using scalable data curation, hard negative mining, and adapter modules for downstream task adaptation.

Segment Anything Model 3 (SAM 3) is an open-vocabulary visual segmentation foundation model enabling detection, segmentation, and tracking of all instances of a “concept” in images and videos, where the concept is specified by a short noun-phrase, an image exemplar, or both. SAM 3 advances promptable segmentation with unified vision-language architectures, scaled data curation, new evaluation protocols, and extensive downstream adaptation (Carion et al., 20 Nov 2025).

1. Promptable Concept Segmentation: Task Definition and Input Modalities

SAM 3 introduces Promptable Concept Segmentation (PCS), a task designed as follows: For a given image or video and a “concept prompt” (short NP, region exemplar, or both), the system yields (a) instance masks for all matching objects, (b) persistent identities per instance in video, (c) semantic foreground/background predictions, and (d) presence/absence signals at the text level.

Prompt encoding involves two distinct modalities:

Text-based prompts: Tokenized and embedded via a transformer text encoder, jointly aligned with the image encoder.
Image exemplar prompts: Regions of interest (boxes plus label flags) are pooled from vision features, augmented with positional and label embeddings, and transformed through a secondary transformer block. The union forms “prompt tokens” that condition vision features by cross-attention in a fusion encoder.

During inference, the fusion encoder incorporates prompt tokens, conditioning detection and segmentation queries that localize and mask all instances matching the specified concept.

2. Model Architecture: Unified Image–Video Backbone and Detector Modules

SAM 3 comprises a single Perception Encoder (PE) backbone—approximately 450M parameters, employing windowed/global attention and RoPE—enabling sharing between the image-level detector and video tracker. The parallel text encoder (∼300M parameters) is contrastively trained on 5.4B image–text pairs for embedding alignment.

Image Detector: Adopts the DETR paradigm—unconditioned image tokens proceed to a fusion encoder (6 transformer layers attending to prompt tokens), then a 6-layer decoder equipped with 200 learned object queries. Each query predicts a bounding box (refined iteratively), a mask (MaskFormer-style), and a classification score.
Memory-Based Video Tracker: Combines mask propagation (as in SAM 2), a prompt encoder, and a memory transformer facilitating self- and cross-attention between current frame and prior masklet features. Masklets (per-object instance masks across frames) are matched by IoU for identity assignment, with ambiguous or low-confidence tracking instances dynamically suppressed or re-prompted.
Presence Head: A distinct “presence token” $q_p$ is introduced, supervised by binary cross-entropy loss for concept presence. Object queries $q_i$ receive classification loss only on positive instances. The final query score is the product of the local match score and global presence score, formally:

$p_i = p(q_i \text{ matches concept} \mid \text{concept present}) \times p(\text{concept present}).$

Ablation studies indicate the presence head increases PCS $F_1$ by +1.5 pp and MCC by +0.05.

3. Data Engine and Scaling: Concept Diversity, Verification, and Negative Mining

SAM 3’s performance is contingent on a scalable data engine yielding high-quality datasets:

Data Composition: 4M unique concept labels and 52M annotated masks across 15 image domains and 52K videos; 1.4B synthetic masks are generated to boost coverage.
Curation Pipeline: Four iterative phases combine mask proposal (OWLv2 + SAM 2), exhaustive human and AI verification (using Llama 3.2), ontology-guided expansion, hard negative mining (adversarial distractors derived from Wikidata-based relationships and LLM generation), and extension to videos with shot-based scene filtering.
Verification: AI verifiers for mask quality and exhaustivity attain parity with humans, shifting annotator focus to edge cases.
Negative Mining: Up to 30 hard negatives per image are mined, leading to marked increases in image-level MCC (from 0.44 to 0.68).

4. Training Protocols and Fine-Tuning

SAM 3 is trained via staged curricula:

Stage 1: Perception Encoder pre-trained contrastively on 5.4B image–text pairs.
Stage 2: Detector pre-training uses large segmentation datasets (human and pseudo labels), with box $L_1$ and gIoU losses, and focal/dice mask losses. Reciprocally scheduled learning rates are applied with AdamW.
Stage 3: Fine-tune on SA-Co (Segment Anything with Concepts) high-quality images with interactive PVS/PCS steps, introducing the presence head.
Stage 4: Video tracker training (frozen backbone) on VOS data and SA-Co videos, optimizing mask, IoU, and occlusion losses.

5. Quantitative and Qualitative Performance

SAM 3 exceeds prior systems across PCS tasks and standard benchmarks:

Image PCS (SA-Co dataset): Zero-shot mask AP = 54.1, $F_1$ = 55.7; OWLv2’s analogous scores are 17.3/16.9.
Closed-set (LVIS): mask AP = 48.5; previous best = 38.5.
Video PCS: SA-Co videos $F_1$ =30.3, pHOTA 58.0 (human 70.5). On YTVIS21, OVIS, LVVIS, and BURST, mAP/HOTA values outperform legacy models.
VOS/PVS: Improvement (J) on MOSEv2 by +6.5 pp, interactive image mIoU by +1.0 pp over SAM 2.
Counting: MAE 0.12/0.21 and accuracy 93.8%/86.2% on CountBench/PixMo-Count.
Ablations: Data scale (+14.6 pp for SA-Co images; +9.1 pp for synthetic pseudo-labels), enhanced verification (+7.2 pp for EV), and domain adaptation observed.
Qualitative Examples: PCS reliably segments “striped cats,” “thin chrome poles,” “thick yellow poles”; exemplar prompts generalize from a single instance to group detections; video PCS robustly tracks objects under occlusion.

6. Downstream Adaptation and Extended Capabilities

SAM 3 forms the backbone for extensions targeting low-level segmentation challenges and complex instruction alignment:

SAM3-Adapter (Chen et al., 24 Nov 2025): Adapter modules per encoder stage enable rapid task-specific adaptation (camouflaged object, shadow, and medical segmentation). Adapters inject prompt vectors via MLPs post-self-attention, trained with segmentation and regularization losses. This approach provides new state-of-the-art performance (e.g., S-measure $S_\alpha$ of 0.944, 0.972, 0.908 on CHAMELEON camo dataset) and, for polyp segmentation, mDice = 0.906 and mIoU = 0.842. Robustness is observed in sharp boundaries and clean mask predictions.
SAM3-I (Li et al., 4 Dec 2025): Cascaded adapters integrated in SAM3’s text encoder and segmentation head facilitate instruction-level reasoning for referring/semi-reasoning expressions. The taxonomy addresses atomic NP (“concept”), simple attribute and spatial relations (“simple”), and multi-hop/affordance-based instructions (“complex”). Training employs KL alignments and uncertainty-aware cross-entropy. On PACO-LVIS-Instruct, SAM3-I achieves notable improvements: simple-instruction gIoU +12.4 pts over agentic baselines; complex-instruction performance matches large agents.

7. Advanced Use Cases, Limitations, and Future Directions

SAM 3 is applied to robotic perception, content creation, annotation, and AR.

3D Perception (Dong et al., 8 Dec 2025): Incorporation of depth estimation and 3D reconstruction (via a MLP depth head and TSDF/mesh integration) enables instrument localization in surgical scenes but is not yet real-time (0.16 FPS). Zero-shot monocular depth (SCARED: AbsRel 0.072, δ<1.25 = 0.957) and 3D IoU (EndoNeRF Cutting, 47.79%) surpass prior systems.
Instruction-level segmentation (Li et al., 4 Dec 2025): Enables single-pass segmentation conditioned on rich instructions, with released pipelines for domain adaptation.
Limitations and Open Problems: Language prompts remain suboptimal in highly specialized domains unless fine-tuned; video tracking is susceptible to drift under rapid motion; proprietary backbone details limit fine-grained analysis; inference cost remains high for long videos or 3D.
Prospective Extensions: Dynamic prompting with MLLMs, native reasoning segmentation, handling longer expressions, reducing video inference cost, robust cross-domain adaptation, and sparsity-aware adapters.

SAM 3 consolidates promptable segmentation, tracking, and open-vocabulary reasoning in a single, extensible vision-LLM. The combination of concept-scale data, unified architecture, and adapter-driven specialization establishes a new paradigm for interactive and instruction-aware mask prediction in both static and dynamic visual environments (Carion et al., 20 Nov 2025, Chen et al., 24 Nov 2025, Li et al., 4 Dec 2025, Dong et al., 8 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (4)

SAM 3: Segment Anything with Concepts (2025)

SAM3-Adapter: Efficient Adaptation of Segment Anything 3 for Camouflage Object Segmentation, Shadow Detection, and Medical Image Segmentation (2025)

SAM3-I: Segment Anything with Instructions (2025)

More than Segmentation: Benchmarking SAM 3 for Segmentation, 3D Perception, and Reconstruction in Robotic Surgery (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Segment Anything Model (SAM) 3.

SAM 3: Unified Visual Segmentation & Tracking

1. Promptable Concept Segmentation: Task Definition and Input Modalities

2. Model Architecture: Unified Image–Video Backbone and Detector Modules

3. Data Engine and Scaling: Concept Diversity, Verification, and Negative Mining

4. Training Protocols and Fine-Tuning

5. Quantitative and Qualitative Performance

6. Downstream Adaptation and Extended Capabilities

7. Advanced Use Cases, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

SAM 3: Unified Visual Segmentation & Tracking

1. Promptable Concept Segmentation: Task Definition and Input Modalities

2. Model Architecture: Unified Image–Video Backbone and Detector Modules

3. Data Engine and Scaling: Concept Diversity, Verification, and Negative Mining

4. Training Protocols and Fine-Tuning

5. Quantitative and Qualitative Performance

6. Downstream Adaptation and Extended Capabilities

7. Advanced Use Cases, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research