SAM 3: Unified Visual Segmentation & Tracking
- SAM 3 is an open-vocabulary visual segmentation model that integrates text and image prompts to detect, segment, and track objects in images and videos.
- It features a unified image-video backbone and transformer-based prompt encoding to refine instance detection with robust cross-attention mechanisms.
- The model achieves significant performance gains using scalable data curation, hard negative mining, and adapter modules for downstream task adaptation.
Segment Anything Model 3 (SAM 3) is an open-vocabulary visual segmentation foundation model enabling detection, segmentation, and tracking of all instances of a “concept” in images and videos, where the concept is specified by a short noun-phrase, an image exemplar, or both. SAM 3 advances promptable segmentation with unified vision-language architectures, scaled data curation, new evaluation protocols, and extensive downstream adaptation (Carion et al., 20 Nov 2025).
1. Promptable Concept Segmentation: Task Definition and Input Modalities
SAM 3 introduces Promptable Concept Segmentation (PCS), a task designed as follows: For a given image or video and a “concept prompt” (short NP, region exemplar, or both), the system yields (a) instance masks for all matching objects, (b) persistent identities per instance in video, (c) semantic foreground/background predictions, and (d) presence/absence signals at the text level.
Prompt encoding involves two distinct modalities:
- Text-based prompts: Tokenized and embedded via a transformer text encoder, jointly aligned with the image encoder.
- Image exemplar prompts: Regions of interest (boxes plus label flags) are pooled from vision features, augmented with positional and label embeddings, and transformed through a secondary transformer block. The union forms “prompt tokens” that condition vision features by cross-attention in a fusion encoder.
During inference, the fusion encoder incorporates prompt tokens, conditioning detection and segmentation queries that localize and mask all instances matching the specified concept.
2. Model Architecture: Unified Image–Video Backbone and Detector Modules
SAM 3 comprises a single Perception Encoder (PE) backbone—approximately 450M parameters, employing windowed/global attention and RoPE—enabling sharing between the image-level detector and video tracker. The parallel text encoder (∼300M parameters) is contrastively trained on 5.4B image–text pairs for embedding alignment.
- Image Detector: Adopts the DETR paradigm—unconditioned image tokens proceed to a fusion encoder (6 transformer layers attending to prompt tokens), then a 6-layer decoder equipped with 200 learned object queries. Each query predicts a bounding box (refined iteratively), a mask (MaskFormer-style), and a classification score.
- Memory-Based Video Tracker: Combines mask propagation (as in SAM 2), a prompt encoder, and a memory transformer facilitating self- and cross-attention between current frame and prior masklet features. Masklets (per-object instance masks across frames) are matched by IoU for identity assignment, with ambiguous or low-confidence tracking instances dynamically suppressed or re-prompted.
- Presence Head: A distinct “presence token” is introduced, supervised by binary cross-entropy loss for concept presence. Object queries receive classification loss only on positive instances. The final query score is the product of the local match score and global presence score, formally:
Ablation studies indicate the presence head increases PCS by +1.5 pp and MCC by +0.05.
3. Data Engine and Scaling: Concept Diversity, Verification, and Negative Mining
SAM 3’s performance is contingent on a scalable data engine yielding high-quality datasets:
- Data Composition: 4M unique concept labels and 52M annotated masks across 15 image domains and 52K videos; 1.4B synthetic masks are generated to boost coverage.
- Curation Pipeline: Four iterative phases combine mask proposal (OWLv2 + SAM 2), exhaustive human and AI verification (using Llama 3.2), ontology-guided expansion, hard negative mining (adversarial distractors derived from Wikidata-based relationships and LLM generation), and extension to videos with shot-based scene filtering.
- Verification: AI verifiers for mask quality and exhaustivity attain parity with humans, shifting annotator focus to edge cases.
- Negative Mining: Up to 30 hard negatives per image are mined, leading to marked increases in image-level MCC (from 0.44 to 0.68).
4. Training Protocols and Fine-Tuning
SAM 3 is trained via staged curricula:
- Stage 1: Perception Encoder pre-trained contrastively on 5.4B image–text pairs.
- Stage 2: Detector pre-training uses large segmentation datasets (human and pseudo labels), with box and gIoU losses, and focal/dice mask losses. Reciprocally scheduled learning rates are applied with AdamW.
- Stage 3: Fine-tune on SA-Co (Segment Anything with Concepts) high-quality images with interactive PVS/PCS steps, introducing the presence head.
- Stage 4: Video tracker training (frozen backbone) on VOS data and SA-Co videos, optimizing mask, IoU, and occlusion losses.
5. Quantitative and Qualitative Performance
SAM 3 exceeds prior systems across PCS tasks and standard benchmarks:
- Image PCS (SA-Co dataset): Zero-shot mask AP = 54.1, = 55.7; OWLv2’s analogous scores are 17.3/16.9.
- Closed-set (LVIS): mask AP = 48.5; previous best = 38.5.
- Video PCS: SA-Co videos =30.3, pHOTA 58.0 (human 70.5). On YTVIS21, OVIS, LVVIS, and BURST, mAP/HOTA values outperform legacy models.
- VOS/PVS: Improvement (J) on MOSEv2 by +6.5 pp, interactive image mIoU by +1.0 pp over SAM 2.
- Counting: MAE 0.12/0.21 and accuracy 93.8%/86.2% on CountBench/PixMo-Count.
- Ablations: Data scale (+14.6 pp for SA-Co images; +9.1 pp for synthetic pseudo-labels), enhanced verification (+7.2 pp for EV), and domain adaptation observed.
- Qualitative Examples: PCS reliably segments “striped cats,” “thin chrome poles,” “thick yellow poles”; exemplar prompts generalize from a single instance to group detections; video PCS robustly tracks objects under occlusion.
6. Downstream Adaptation and Extended Capabilities
SAM 3 forms the backbone for extensions targeting low-level segmentation challenges and complex instruction alignment:
- SAM3-Adapter (Chen et al., 24 Nov 2025): Adapter modules per encoder stage enable rapid task-specific adaptation (camouflaged object, shadow, and medical segmentation). Adapters inject prompt vectors via MLPs post-self-attention, trained with segmentation and regularization losses. This approach provides new state-of-the-art performance (e.g., S-measure of 0.944, 0.972, 0.908 on CHAMELEON camo dataset) and, for polyp segmentation, mDice = 0.906 and mIoU = 0.842. Robustness is observed in sharp boundaries and clean mask predictions.
- SAM3-I (Li et al., 4 Dec 2025): Cascaded adapters integrated in SAM3’s text encoder and segmentation head facilitate instruction-level reasoning for referring/semi-reasoning expressions. The taxonomy addresses atomic NP (“concept”), simple attribute and spatial relations (“simple”), and multi-hop/affordance-based instructions (“complex”). Training employs KL alignments and uncertainty-aware cross-entropy. On PACO-LVIS-Instruct, SAM3-I achieves notable improvements: simple-instruction gIoU +12.4 pts over agentic baselines; complex-instruction performance matches large agents.
7. Advanced Use Cases, Limitations, and Future Directions
SAM 3 is applied to robotic perception, content creation, annotation, and AR.
- 3D Perception (Dong et al., 8 Dec 2025): Incorporation of depth estimation and 3D reconstruction (via a MLP depth head and TSDF/mesh integration) enables instrument localization in surgical scenes but is not yet real-time (0.16 FPS). Zero-shot monocular depth (SCARED: AbsRel 0.072, δ<1.25 = 0.957) and 3D IoU (EndoNeRF Cutting, 47.79%) surpass prior systems.
- Instruction-level segmentation (Li et al., 4 Dec 2025): Enables single-pass segmentation conditioned on rich instructions, with released pipelines for domain adaptation.
- Limitations and Open Problems: Language prompts remain suboptimal in highly specialized domains unless fine-tuned; video tracking is susceptible to drift under rapid motion; proprietary backbone details limit fine-grained analysis; inference cost remains high for long videos or 3D.
- Prospective Extensions: Dynamic prompting with MLLMs, native reasoning segmentation, handling longer expressions, reducing video inference cost, robust cross-domain adaptation, and sparsity-aware adapters.
SAM 3 consolidates promptable segmentation, tracking, and open-vocabulary reasoning in a single, extensible vision-LLM. The combination of concept-scale data, unified architecture, and adapter-driven specialization establishes a new paradigm for interactive and instruction-aware mask prediction in both static and dynamic visual environments (Carion et al., 20 Nov 2025, Chen et al., 24 Nov 2025, Li et al., 4 Dec 2025, Dong et al., 8 Dec 2025).