Segment Anything Model v2 (SAM-2)

Updated 10 November 2025

Segment Anything Model v2 (SAM-2) is a unified promptable segmentation framework that uses temporal memory and hierarchical encoding for robust image and video analysis.
It employs a modular design featuring a hierarchical image encoder, multi-modal prompt encoder, and mask decoder with memory attention, achieving up to 130 FPS.
SAM-2 demonstrates high accuracy across biomedical, remote sensing, and fine-grained segmentation domains, with domain-specific adaptations to address specialized challenges.

Segment Anything Model v2 (SAM-2) is a unified, promptable visual segmentation architecture designed for both still images and temporal video streams. Developed by Meta AI Research, SAM-2 builds upon the original SAM model by incorporating a temporal memory mechanism, streamlined transformer backbone, and a general-purpose prompt encoder, enabling robust, real-time segmentation in a wide array of tasks and domains. Below is a comprehensive, technical synthesis of SAM-2's design principles, architecture, quantifiable performance, domain-specific adaptations, strengths and trade-offs, and future directions as evidenced by the 2024–2025 primary sources.

1. Architectural Innovations

SAM-2 is constructed as a modular pipeline consisting of a hierarchical image encoder, a multi-modal prompt encoder, a mask decoder augmented for memory-attention, and a streaming memory subsystem for video temporal reasoning (Ravi et al., 2024).

Image Encoder:

SAM-2 replaces the original single-scale ViT with a multi-scale hierarchical Vision Transformer (Hiera) pretrained with masked autoencoding. The encoder outputs stride-4, stride-8, stride-16, and stride-32 feature maps, allowing fine structure to bypass directly into decoder upsampling blocks, while lower-resolution features are aggregated for temporal memory attention (Ravi et al., 2024, Geetha et al., 2024).

Prompt Encoder:

SAM-2 processes prompts as spatial points, bounding boxes, dense mask regions, and text strings. Points and boxes are converted via learned positional embeddings, while text is encoded with a CLIP-derived transformer. Temporal prompt tokens enable correspondence across video frames. All prompt embeddings are indexed in time, permitting arbitrary assignment to any frame within a sequence (Ravi et al., 2024, Tang et al., 2024).

Mask Decoder and Memory Attention:

A lightweight transformer mask decoder carries skip connections from high-resolution image features and integrates outputs from a memory-bank via explicit multi-head cross-attention: $\mathrm{Attn}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$ where frame tokens $Q$ attend over stored key/value tensors $K, V$ from $L$ previous frames. Rotary positional encoding strengthens spatial context. The decoder produces $K$ mask logits per prompt, a per-mask IoU score, and an occlusion head for per-frame visibility (Ravi et al., 2024, Yan et al., 2024).

Streaming Memory:

Memory bank $\mathcal{M}_K^{(t)}, \mathcal{M}_V^{(t)}$ maintains up to $N$ recent encoder/decoder features per object tracked, supporting robust propagation through occlusion, scene change, and even object disappearance (Ravi et al., 2024, Geetha et al., 2024). At each new frame: $\mathcal{M}_K^{(t+1)} = \mathrm{enqueue}(\mathcal{M}_K^{(t)}, K_t)$ forming a sliding window for memory-attention computation.

2. Data Engine and Training Paradigm

SAM-2 introduces an interactive annotation data engine, systematically improving model and ground-truth quality via model-in-the-loop learning (Ravi et al., 2024). The three-phase protocol leverages human annotation, semi-automatic mask propagation, and full SAM-2 in the loop for mask refinement, ultimately assembling the 50.9 K–video, 35.5 M–mask SA-V dataset.

Pre-training uses SA-1B images; full training mixes image and video data in jointly optimized objectives.
Losses include focal loss, Dice loss for mask accuracy, IoU loss for mask ranking, and cross-entropy for the new occlusion head.
Data augmentation covers geometry, color, and simulated occlusion.

3. Quantitative Evaluation Across Domains

SAM-2 has been evaluated against state-of-the-art (SOTA) specialist and foundation models in generic segmentation, video object segmentation (VOS), instance-level, biomedical, and remote sensing contexts.

Video Segmentation and Mask Propagation:

VOS, first-frame GT mask: J&F scores up to 91.6 on DAVIS 17 (val) (Ravi et al., 2024).
Promptable video accuracy: surpasses SAM+XMem++ and SAM+Cutie by ~7–9 J&F points across standard video benchmarks under 3-click per frame interaction.

Image Segmentation:

On 37 zero-shot datasets, SAM-2 (Hiera-B+) achieves 58.9 / 81.7 mIoU (1/5 clicks) at 130 FPS, outperforming SAM-1 (ViT-H) (Ravi et al., 2024).

Class-Agnostic Instance and Fine-Grained Segmentation:

In box-prompt mode, SAM-2 matches or exceeds SOTA on salient, camouflaged, and shadow instance segmentation: e.g., AP70 = 96.7 (ILSO), AP = 68.8 (COD10K) (Pei et al., 2024).
For fine detail (DIS task), F-measure gains are evident over SAM, but remain subpar to supervised SOTA—highlighting the prompt- and resolution-driven limitations.

Biomedical and Medical Domains:

SAM-2, when adapted as MedSAM-2 and BioSAM-2, achieves +36.9% uplift over vanilla SAM-2 on 3D multi-organ BTCV Dice score (88.6 vs. 51.6), and top Dice scores on 2D/3D organ/lesion benchmarks, surpassing even fine-tuned CNNs/Transformers without per-dataset tuning (Zhu et al., 2024, Yan et al., 2024).
One-prompt propagation in medical workflows is enabled via self-sorting memory banks, eliminating excessive user interaction.
In cell tracking, zero-shot SAM-2 matches or exceeds specialist methods in linking accuracy (LNK=0.984, BIO=0.862) without dataset-specific bias (Chen et al., 12 Sep 2025).

Remote Sensing and Vision+Language:

RS2-SAM-2 adapts the baseline by union vision/text encoding, bidirectional hierarchical fusion, dense mask-prompt generation, and achieves state-of-the-art mean IoU and overall IoU in referring RS image segmentation benchmarks (Rong et al., 10 Mar 2025).
Dense prompts and text-guided boundary loss are essential for small or camouflaged object localization.

Prompt Strategy Insights:

User bounding boxes maximize IoU (~0.79 in high-res/optimal lighting) and robustness (Rafaeli et al., 2024).
Sparse points are sensitive to adverse conditions, but SAM-2 offers improved mask growing (ΔIoU +0.06 vs. SAM in shaded imagery).
Automated YOLOv9 boxes provide reliable, fully automatic prompts, matching CNN performance in favorable scenarios.

4. Domain-Specific Adaptations and Limitations

SAM-2’s generalist design is subject to domain gaps when applied to specialized data such as medical images, microscopy, remote sensing, or camouflaged objects:

Medical imaging: The natural-image pretraining causes under-segmentation of subtle anatomical structures. MedSAM-2 mitigates this with confidence-based memory filtering and prompt propagation, but further fine-tuning of encoder/decoder heads is often needed for full SOTA performance (Zhu et al., 2024, Yan et al., 2024).
Camouflaged Object Detection: In prompt-free auto mode, SAM-2’s recall drops dramatically compared to SAM-1 (e.g., Fβ^w=0.184 vs. 0.606 on COD10K), due to a conservative mask-generator and high confidence thresholds (Tang et al., 2024). Promptable mode offsets this loss with explicit guidance, but general camouflage detection benefits from mask diversity and lower confidence calibration.
Fine-grained and high-resolution detail: Default input and mask resolutions limit boundary accuracy (evidenced by Human Correction Effort on DIS benchmark). Prompt engineering and multi-scale inputs are necessary for slender or textured object recovery (Pei et al., 2024).

5. Temporal Reasoning and Memory Attention Mechanisms

The transition to video segmentation is anchored by SAM-2’s streaming memory attention and object pointer constructs:

Temporal memory: Four-step memory bank enables mask persistence, occlusion recovery, and drift-resistant tracking.
Progressive sifting: Intermediate representations reveal a trajectory where raw encoder output is ambiguous, memory attention begins context-filtering, prompt cross-attention isolates the target, and the mask decoder commits to object identity (Bromley et al., 25 Feb 2025).
Quantitative separability: At prompt-attention and pointer stages, linearly separate embeddings (>99% frame classification accuracy) demarcate object-present versus absent frames even under occlusions, overlays, and interjections.

6. Scalability, Throughput, and Deployment

SAM-2 is engineered for real-time inference:

Inference speed: Hiera-B+ backbone achieves up to 130 FPS (1024×1024), a six-fold improvement over SAM (Ravi et al., 2024, Rafaeli et al., 2024).
Prompt efficiency: 3× fewer interactions required for video segmentation compared to prior approaches.
Dataset scale: Trained on 50.9 K video, 35 M mask SA-V, in addition to SA-1B, ensuring object, scene, and context diversity.
Open-source availability: Permissive licenses and large-scale datasets underpin reproducibility and community adoption.

7. Recommendations, Future Directions, and Open Technical Challenges

Persistent technical themes include:

Prompt engineering: Enhanced localization via adaptive proposal modules, multi-scale prompt resolution, and domain-specific adapters are recommended for challenging instances (Pei et al., 2024, Rong et al., 10 Mar 2025).
Memory adaptation: Confidence sorting and weighted fusion (as in Medical SAM 2) unlock one-prompt segmentation, minimize excessive interaction, and track objects in both 2D and 3D.
Boundary refinement: Auxiliary losses (e.g., text-guided boundary loss) and improved upsampling blocks sharpen output mask edges, critical under adverse imaging conditions (Rong et al., 10 Mar 2025).
Domain adaptation: Fine-tuning on biomedical, remote sensing, low-SNR, or camouflaged object datasets improves recall and detail. Freezing prompt modules while adapting encoder/decoder is an effective strategy (Yan et al., 2024).
Scalability and context: The bounded sliding window memory (L=4 frames) is insufficient for long video or volumetric contexts; future work may incorporate adaptive memory or object-graph priors (Geetha et al., 2024).
Auto vs. Promptable tradeoffs: SAM-2 sacrifices promptless mask diversity for temporal consistency and conservative masking; recalibration and multi-threshold decoding are needed to recover sensitivity for subtle detection tasks (Tang et al., 2024).

SAM-2 thus represents an overview of broad zero-shot segmentation capability, efficient temporal tracking, and modular prompt engineering, while ongoing research addresses its limitations in auto discovery, fine-detail segmentation, and multi-domain adaptation. Its open-source release and documented empirical benchmarks facilitate further advancement in both generic and specialized computer vision applications.