Segment Anything with Concepts (SA-Co)

Updated 10 February 2026

The paper introduces a unified framework that grounds segmentation masks from open-vocabulary prompts for detection, segmentation, and video tracking.
It details a scalable data engine combining human verification with synthetic, curated, and web-sourced images to create a high-quality segmentation dataset.
The SAM 3 architecture leverages a dual encoder-decoder design, enhancing real-time segmentation accuracy through optimized mask prediction and tracking.

Segment Anything with Concepts (SA-Co) denotes a unified approach and benchmark for large-scale, promptable concept segmentation in images and videos. The methodology centers on grounding segmentation masks directly in open-vocabulary "concept" prompts—short noun phrases or image exemplars—enabling the detection, segmentation, and, in videos, tracking of arbitrary user-specified categories not restricted to pre-defined label sets. The term encompasses both the Segment Anything Model 3 (SAM 3) as well as the data engine and benchmarks underpinning the comprehensive evaluation of promptable concept segmentation (PCS) (Carion et al., 20 Nov 2025).

1. Foundations of Promptable Concept Segmentation

Promptable Concept Segmentation (PCS) is formulated as follows: given an image or a short video clip $\{I_t\}$ , and a concept prompt $P$ —either a short noun phrase (e.g., "striped cat"), zero or more image exemplars (positive/negative bounding boxes), or a hybrid thereof—the goal is to output for every instance matching $P$ a segmentation mask $m_i$ (and in the case of video, a persistent identity track $i$ ). The prompt is represented as concatenated tokens, with the textual component encoded via a text encoder $E_{\text{text}}$ and image exemplars embedded by $E_{\text{ex}}$ , generating $p_{\text{text}} = E_{\text{text}}(P)$ , $p_{\text{img}_j} = E_{\text{ex}}(\mathrm{ROI}(I, \mathrm{box}_j), \mathrm{label}_j)$ (Carion et al., 20 Nov 2025).

Conditioning is achieved by cross-attending unconditioned image-patch tokens $X = \{x_n\}$ from a shared perception encoder (Vision Transformer, ViT-based) with prompt tokens using a dedicated fusion encoder. This contextualizes the backbone features for query decoders, supporting open-vocabulary instance segmentation.

2. Scalable Data Engine and Ontological Expansion

SA-Co addresses the data bottleneck for open-vocabulary segmentation through a four-phase, human-plus-AI data engine, spanning three major image sources (high-quality hand-curated, synthetic, external web) and a dedicated video source. The pipeline includes:

Mining (image, NP) pairs via captioner and parser.
Generating masks with predecessor segmentation models (e.g., SAM 2) and an open-vocab detector.
Rigorous human Mask Verification (MV), Exhaustivity Verification (EV), and human correction for high-quality mask ground truth.
Training LLaMA-based AI verifiers to automate mask validation processes, scaling up with adversarial hard negative prompt sampling.
Domain expansion through the Wikidata ontology for coverage across 15 domains (e.g., robotics, art, medical).
Video-specific phase employing masklet propagation and human correction.

The result is a dataset comprising 5.2M high-quality images (4.0M unique noun phrases, 52.3M masks), 39.4M synthetic images (1.7B image-NP pairs, 1.4B masks), 9.3M external images (136.6M image-NPs, 70.5M masks), and 52.5K videos (134K video-NP pairs, 467K masklets). This modular engine enables out-of-domain generalization and synthetic-to-real adaptation (Carion et al., 20 Nov 2025).

3. Model Architecture: Unified Detection, Segmentation, and Tracking

SAM 3 implements an end-to-end dual encoder–decoder architecture with a shared PE backbone comprising vision (ViT with windowed/global attention) and text encoders (RoPE, contrastively pre-trained on 5.4B image-text pairs). Key architectural components:

Detector:
- Inputs are encoded images and concatenated prompt tokens.
- Fusion encoder (stacked transformer layers) cross-attends from image to prompt.
- Transformer decoder uses learned object queries for instance prediction.
- Mask head (MaskFormer) and box head (MLP) yield segmentation masks and bounding boxes.
- Semantic and presence heads: the presence head introduces a token $q_{\mathrm{pres}}$ fused with queries at each layer, providing a presence probability $p_{\mathrm{pres}} = \sigma(w^T q_{\mathrm{pres}})$ that modulates per-query localization scores $s_i = p_{\mathrm{pres}} \cdot p_{\text{loc}_i}$ .
Video Tracker:
- Masklet-based: single-frame propagation (as in SAM 2), IoU-based detection–masklet matching, confirmation delay, duplicate suppression, periodic re-prompting.
- Temporal disambiguation is applied via a track confirmation delay, duplicate-mask suppression, and identity maintenance over shots.

This design decouples recognition and localization, boosting accuracy and enabling end-to-end optimization for both images and videos with prompt-driven queries (Carion et al., 20 Nov 2025).

4. Training Objectives and Optimization

Training encompasses detection, segmentation, classification, and presence tasks. The composite detector loss is: $\mathcal{L}_{\text{det}} = \lambda_{\text{bbox}} \mathcal{L}_{\text{L1}}(b, b^*) + \lambda_{\text{gIoU}} \mathcal{L}_{\text{gIoU}}(b, b^*) + \lambda_{\text{cls}} \mathcal{L}_{\text{focal}}(\text{scores}, s^*) + \lambda_{\text{mask}} [\mathcal{L}_{\text{dice}} + \mathcal{L}_{\text{focal}}](\text{mask}, \text{mask}^*) + \lambda_{\text{pres}} \mathcal{L}_{\text{BCE}}(p_{\text{pres}}, \text{presence}^*)$ with typical weights $\lambda_{\text{bbox}}{=}5$ , $\lambda_{\text{gIoU}}{=}2$ , $\lambda_{\text{cls}}{=}100$ , mask-focal $=200$ , mask-dice $=10$ , $\lambda_{\text{pres}}{=}20$ .

For video, an auxiliary VOS loss adds (focal+dice), MAE(IoU), and cross-entropy for occlusion prediction. Overall, total loss is $\mathcal{L} = \mathcal{L}_{\text{det}} + \mathcal{L}_{\text{vos}}$ . Training is staged, with domain adaptation using synthetic data and a systematic progression from pre-training to large-scale fine-tuning on the SA-Co data engine outputs (Carion et al., 20 Nov 2025).

5. Inference Pipeline and Real-Time Segmentation

At inference, the system enables:

Image PCS: A single forward pass yields up to $\sim$ 200 segmentation proposals, filtered by final scores ( $s_i \geq \tau=0.5$ ), with per-mask computation under 30 ms for $>$ 100 objects per image.
Video PCS: Detects new objects framewise, propagates extant masklets, matches via IoU, executes temporal disambiguation routines, and supports near real-time tracking for approximately five concurrent objects in 30-second clips.

This pipeline supports interactive segmentation at scale, with persistent identity assignment for tracked objects in videos.

6. Benchmarking, Empirical Results, and Ablations

SA-Co introduces new benchmarks (207K NPs, 3.3M image-NPs, 11K videos) and metrics tailored to concept segmentation:

Classification-gated F1 (cgF1): $cgF1 = 100 \cdot pmF1 \cdot IL_{MCC}$
Additional: AP (box detection: 0.5–0.95 IoU), mIoU, interactive PVS (Jaccard for interactive scenarios), and video PCS metrics (cgF1, pHOTA, mAP).

Key results (zero-shot, SA-Co Image gold-standard):

Method	cgF1	IL_MCC	pmF1
OWLv2	17.3	0.46	36.8
GroundingDino	3.3	0.15	27.3
DINO-X	21.3	0.38	55.2
Gemini2.5	13.0	0.29	46.1
SAM 3	54.1	0.82	66.1
Human	~72.8	0.94	77.0

Box detection AP: SAM 3 achieves $53.7\%$ on LVIS, outperforming OWLv2 ( $\sim 45.5\%$ ). Video PCS cgF1: SAM 3 ($30.3$) compared to OWLv2/ GLEE ($0.1$) (Carion et al., 20 Nov 2025).

Ablation studies demonstrate:

The presence head improves F1 and IL_MCC (+1.5 F1).
Hard negative mining increases baseline cgF1 from $28.3$ (no negatives) to $43.0$ (30 negatives/image).
Training data scaling (adding synthetic and HQ) lifts cgF1 from $23.7$ (external only) to $47.4$ (all sources).
Incorporating AI verifiers improves cgF1 from $54.0$ (none) to $62.3$, closing half the gap to human performance.
For domain adaptation, with $10\%$ synthetic domain data, performance approaches that of HQ-labeled data without further manual annotation.

7. Relationship to Concept-Based XAI and Concept Mask Segmentation

SA-Co builds on the progression from fixed-concept or cluster-based instance segmentation (Sun et al., 2023) and earlier work in semantic concept segmentation (Wang et al., 2018), addressing limitations of manual annotation, fixed concept sets, or indirect discovery via clustering.

The Concept Mask approach defines segmentation as $(I, c) \rightarrow M$ with concept embedding $e_c$ and attention-map driven segmentation, capable of weakly- and zero-shot generalization for 18K+ concepts (Wang et al., 2018).
Explain Any Concept (EAC) leverages SAM to produce per-image segment candidates, introducing a per-input equivalent (PIE) scheme for efficient, Shapley-value-based, concept-level explanations in black-box model analysis. EAC reports higher insertion AUC and expert-judged interpretability than prior XAI methods (Sun et al., 2023).
SAM 3/SA-Co extends key principles from both, supporting promptable, open-vocabulary, and cross-domain concept segmentation and tracking at scale with high empirical accuracy, human-aligned ground truth, and a scalable, modular data pipeline.

This suggests that SA-Co represents a paradigmatic shift towards explicit concept-level interaction in segmentation tasks, supporting real-world demands for both explainability and flexible, open-vocabulary understanding in vision models.

Markdown Report Issue Upgrade to Chat

References (3)

SAM 3: Segment Anything with Concepts (2025)

Explain Any Concept: Segment Anything Meets Concept-Based Explanation (2023)

Concept Mask: Large-Scale Segmentation from Semantic Concepts (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Segment Anything with Concepts (SA-Co).