Papers
Topics
Authors
Recent
Search
2000 character limit reached

SAM 3: Unified Visual Segmentation & Tracking

Updated 31 January 2026
  • SAM 3 is an open-vocabulary visual segmentation model that integrates text and image prompts to detect, segment, and track objects in images and videos.
  • It features a unified image-video backbone and transformer-based prompt encoding to refine instance detection with robust cross-attention mechanisms.
  • The model achieves significant performance gains using scalable data curation, hard negative mining, and adapter modules for downstream task adaptation.

Segment Anything Model 3 (SAM 3) is an open-vocabulary visual segmentation foundation model enabling detection, segmentation, and tracking of all instances of a “concept” in images and videos, where the concept is specified by a short noun-phrase, an image exemplar, or both. SAM 3 advances promptable segmentation with unified vision-language architectures, scaled data curation, new evaluation protocols, and extensive downstream adaptation (Carion et al., 20 Nov 2025).

1. Promptable Concept Segmentation: Task Definition and Input Modalities

SAM 3 introduces Promptable Concept Segmentation (PCS), a task designed as follows: For a given image or video and a “concept prompt” (short NP, region exemplar, or both), the system yields (a) instance masks for all matching objects, (b) persistent identities per instance in video, (c) semantic foreground/background predictions, and (d) presence/absence signals at the text level.

Prompt encoding involves two distinct modalities:

  • Text-based prompts: Tokenized and embedded via a transformer text encoder, jointly aligned with the image encoder.
  • Image exemplar prompts: Regions of interest (boxes plus label flags) are pooled from vision features, augmented with positional and label embeddings, and transformed through a secondary transformer block. The union forms “prompt tokens” that condition vision features by cross-attention in a fusion encoder.

During inference, the fusion encoder incorporates prompt tokens, conditioning detection and segmentation queries that localize and mask all instances matching the specified concept.

2. Model Architecture: Unified Image–Video Backbone and Detector Modules

SAM 3 comprises a single Perception Encoder (PE) backbone—approximately 450M parameters, employing windowed/global attention and RoPE—enabling sharing between the image-level detector and video tracker. The parallel text encoder (∼300M parameters) is contrastively trained on 5.4B image–text pairs for embedding alignment.

  • Image Detector: Adopts the DETR paradigm—unconditioned image tokens proceed to a fusion encoder (6 transformer layers attending to prompt tokens), then a 6-layer decoder equipped with 200 learned object queries. Each query predicts a bounding box (refined iteratively), a mask (MaskFormer-style), and a classification score.
  • Memory-Based Video Tracker: Combines mask propagation (as in SAM 2), a prompt encoder, and a memory transformer facilitating self- and cross-attention between current frame and prior masklet features. Masklets (per-object instance masks across frames) are matched by IoU for identity assignment, with ambiguous or low-confidence tracking instances dynamically suppressed or re-prompted.
  • Presence Head: A distinct “presence token” qpq_p is introduced, supervised by binary cross-entropy loss for concept presence. Object queries qiq_i receive classification loss only on positive instances. The final query score is the product of the local match score and global presence score, formally:

pi=p(qi matches conceptconcept present)×p(concept present).p_i = p(q_i \text{ matches concept} \mid \text{concept present}) \times p(\text{concept present}).

Ablation studies indicate the presence head increases PCS F1F_1 by +1.5 pp and MCC by +0.05.

3. Data Engine and Scaling: Concept Diversity, Verification, and Negative Mining

SAM 3’s performance is contingent on a scalable data engine yielding high-quality datasets:

  • Data Composition: 4M unique concept labels and 52M annotated masks across 15 image domains and 52K videos; 1.4B synthetic masks are generated to boost coverage.
  • Curation Pipeline: Four iterative phases combine mask proposal (OWLv2 + SAM 2), exhaustive human and AI verification (using Llama 3.2), ontology-guided expansion, hard negative mining (adversarial distractors derived from Wikidata-based relationships and LLM generation), and extension to videos with shot-based scene filtering.
  • Verification: AI verifiers for mask quality and exhaustivity attain parity with humans, shifting annotator focus to edge cases.
  • Negative Mining: Up to 30 hard negatives per image are mined, leading to marked increases in image-level MCC (from 0.44 to 0.68).

4. Training Protocols and Fine-Tuning

SAM 3 is trained via staged curricula:

  • Stage 1: Perception Encoder pre-trained contrastively on 5.4B image–text pairs.
  • Stage 2: Detector pre-training uses large segmentation datasets (human and pseudo labels), with box L1L_1 and gIoU losses, and focal/dice mask losses. Reciprocally scheduled learning rates are applied with AdamW.
  • Stage 3: Fine-tune on SA-Co (Segment Anything with Concepts) high-quality images with interactive PVS/PCS steps, introducing the presence head.
  • Stage 4: Video tracker training (frozen backbone) on VOS data and SA-Co videos, optimizing mask, IoU, and occlusion losses.

5. Quantitative and Qualitative Performance

SAM 3 exceeds prior systems across PCS tasks and standard benchmarks:

  • Image PCS (SA-Co dataset): Zero-shot mask AP = 54.1, F1F_1 = 55.7; OWLv2’s analogous scores are 17.3/16.9.
  • Closed-set (LVIS): mask AP = 48.5; previous best = 38.5.
  • Video PCS: SA-Co videos F1F_1=30.3, pHOTA 58.0 (human 70.5). On YTVIS21, OVIS, LVVIS, and BURST, mAP/HOTA values outperform legacy models.
  • VOS/PVS: Improvement (J) on MOSEv2 by +6.5 pp, interactive image mIoU by +1.0 pp over SAM 2.
  • Counting: MAE 0.12/0.21 and accuracy 93.8%/86.2% on CountBench/PixMo-Count.
  • Ablations: Data scale (+14.6 pp for SA-Co images; +9.1 pp for synthetic pseudo-labels), enhanced verification (+7.2 pp for EV), and domain adaptation observed.
  • Qualitative Examples: PCS reliably segments “striped cats,” “thin chrome poles,” “thick yellow poles”; exemplar prompts generalize from a single instance to group detections; video PCS robustly tracks objects under occlusion.

6. Downstream Adaptation and Extended Capabilities

SAM 3 forms the backbone for extensions targeting low-level segmentation challenges and complex instruction alignment:

  • SAM3-Adapter (Chen et al., 24 Nov 2025): Adapter modules per encoder stage enable rapid task-specific adaptation (camouflaged object, shadow, and medical segmentation). Adapters inject prompt vectors via MLPs post-self-attention, trained with segmentation and regularization losses. This approach provides new state-of-the-art performance (e.g., S-measure SαS_\alpha of 0.944, 0.972, 0.908 on CHAMELEON camo dataset) and, for polyp segmentation, mDice = 0.906 and mIoU = 0.842. Robustness is observed in sharp boundaries and clean mask predictions.
  • SAM3-I (Li et al., 4 Dec 2025): Cascaded adapters integrated in SAM3’s text encoder and segmentation head facilitate instruction-level reasoning for referring/semi-reasoning expressions. The taxonomy addresses atomic NP (“concept”), simple attribute and spatial relations (“simple”), and multi-hop/affordance-based instructions (“complex”). Training employs KL alignments and uncertainty-aware cross-entropy. On PACO-LVIS-Instruct, SAM3-I achieves notable improvements: simple-instruction gIoU +12.4 pts over agentic baselines; complex-instruction performance matches large agents.

7. Advanced Use Cases, Limitations, and Future Directions

SAM 3 is applied to robotic perception, content creation, annotation, and AR.

  • 3D Perception (Dong et al., 8 Dec 2025): Incorporation of depth estimation and 3D reconstruction (via a MLP depth head and TSDF/mesh integration) enables instrument localization in surgical scenes but is not yet real-time (0.16 FPS). Zero-shot monocular depth (SCARED: AbsRel 0.072, δ<1.25 = 0.957) and 3D IoU (EndoNeRF Cutting, 47.79%) surpass prior systems.
  • Instruction-level segmentation (Li et al., 4 Dec 2025): Enables single-pass segmentation conditioned on rich instructions, with released pipelines for domain adaptation.
  • Limitations and Open Problems: Language prompts remain suboptimal in highly specialized domains unless fine-tuned; video tracking is susceptible to drift under rapid motion; proprietary backbone details limit fine-grained analysis; inference cost remains high for long videos or 3D.
  • Prospective Extensions: Dynamic prompting with MLLMs, native reasoning segmentation, handling longer expressions, reducing video inference cost, robust cross-domain adaptation, and sparsity-aware adapters.

SAM 3 consolidates promptable segmentation, tracking, and open-vocabulary reasoning in a single, extensible vision-LLM. The combination of concept-scale data, unified architecture, and adapter-driven specialization establishes a new paradigm for interactive and instruction-aware mask prediction in both static and dynamic visual environments (Carion et al., 20 Nov 2025, Chen et al., 24 Nov 2025, Li et al., 4 Dec 2025, Dong et al., 8 Dec 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Segment Anything Model (SAM) 3.