Self-Prompted Depth-Aware SAM

Updated 9 February 2026

SPDA-SAM is a framework that integrates self-prompt generation with depth estimation, enabling spatially-aware segmentation across varied domains.
It leverages geometric, learned, and dense prompting techniques to autonomously generate segmentation cues from both RGB and depth data.
Modular adapters and fusion modules are incorporated into SAM’s core architecture, yielding significant performance gains in urban, natural, and medical imaging applications.

Self-prompted Depth-Aware Segment Anything Model (SPDA-SAM) refers to a family of frameworks that combine the vision transformer-based Segment Anything Model (SAM) with automatic, dense prompt generation and explicit depth information integration for instance or semantic segmentation. These methods address two key limitations of original SAM: reliance on human-provided (or external) prompts, and lack of inherent 3D spatial awareness due to the exclusive use of RGB imagery. SPDA-SAM leverages depth cues—extracted from monocular, multi-view, or volumetric data—to guide both prompt generation and feature fusion, enabling spatially-aware, fully automated segmentation across diverse domains including natural scenes, urban environments, camouflaged object detection, and medical imaging.

1. Core Architectural Principles

SPDA-SAM variants are unified by three principles:

Self-prompting: All manual user prompts are replaced or supplemented by prompts generated automatically from image content. Strategies include:
- Extracting region proposals or bounding boxes from depth-derived geometric cues (e.g., closed depression computation (Rafaeli et al., 2024)).
- Learning semantic and spatial prompts from intermediate encoder/decoder features (SSSPM (Shang et al., 6 Feb 2026), multi-scale prompt generators (Xie et al., 2 Feb 2025)).
- Using predicted depth or disparity maps as dense, geometric prompts (DGA-Net (Li et al., 6 Jan 2026), multi-view self-prompts (Shvets et al., 2023)).
Depth Awareness: Depth estimation, either from monocular models (e.g., Depth Anything V2), multi-view stereo (MVS), or medical volume data, plays a central role. It informs:
- Prompt generation by highlighting 3D spatial structures not obvious in RGB (e.g., depressions for sinkholes (Rafaeli et al., 2024)).
- Feature fusion architectures that align and integrate RGB and depth cues at multiple scales (coarse-to-fine fusion modules (Shang et al., 6 Feb 2026), cross-modal graph modules (Li et al., 6 Jan 2026)).
Decoupled, Modular Integration with SAM: SPDA-SAM architectures typically leave SAM’s core vision transformer largely intact (often frozen), inserting learnable, lightweight adapters (e.g., LoRA (Shang et al., 6 Feb 2026), DfusedAdapter (Xie et al., 2 Feb 2025)), prompt generators, or fusion blocks either before or within the mask decoder.

2. Self-Prompt Generation Techniques

There are two major categories of self-prompting strategies in SPDA-SAM:

Geometric Prompt Extraction: In terrains (e.g., sinkholes), closed depression analysis on relative depth maps is applied: a fill operation yields $d_\text{filled}$ , which, subtracted from the input depth $d(x,y)$ , highlights sunken regions. Thresholding and connected component analysis generate bounding boxes used as SAM prompts (Rafaeli et al., 2024). Similar logic is used for 3D medical volume regions (multi-scale auxiliary masks via convolutional decoders (Xie et al., 2 Feb 2025)) and multi-view segmentation (predicted depth/disparity as prompt embeddings (Shvets et al., 2023)).
Learned Prompt Modules: Semantic-Spatial Self-prompt Modules (SSSPM) synthesize prompts by fusing low- and high-level features from depth-aware and RGB streams. For example, SSSPM in (Shang et al., 6 Feb 2026) aligns features from two depths of the transformer, creates attention-weighted semantic maps, and encodes both semantic (from features) and spatial (from coarse masks) cues, merging these for refined prompting.
Dense Geometric Prompting: DGA-Net (Li et al., 6 Jan 2026) uses the predicted dense depth as an input to the prompt encoder, with further calibration by graph-based attention over multi-scale RGB and depth features, forming continuous, high-dimensional geometric prompts for the mask decoder.

3. Depth-Aware Feature Fusion Architectures

Effective integration of RGB and depth modalities is accomplished through several mechanisms:

Coarse-to-Fine RGB-D Fusion (C2FFM): Dual-path ViT encoders process RGB and depth inputs separately, with fusion blocks inserted after patch embeddings (“coarse” for global structure) and after interleaved transformer blocks (“fine” for local variations). At each stage, features are aligned using dilated convolutions, attention mechanisms, or explicit cross-modal attention (Shang et al., 6 Feb 2026).
Graph-Based Cross-Modal Fusion: DGA-Net (Li et al., 6 Jan 2026) employs Cross-modal Graph Enhancement (CGE), constructing a heterogeneous graph comprising both RGB and depth nodes from multi-scale features. Graph pooling and cross-modal attention propagate semantic/geometric cues, which are then unpooled to original shapes for subsequent decoding.
Depth-Fused Adapters (DfusedAdapter): In medical imaging, lightweight MLP bottleneck adapters are injected into each transformer block, fusing information along the depth (slice) axis and augmenting standard 2D SAM representations with 3D context. These adapters are trained while the bulk of the SAM weights remain frozen (Xie et al., 2 Feb 2025).
Semantic-Structural Anchoring and Cross-Level Propagation: AGR modules (Anchor-Guided Refinement, (Li et al., 6 Jan 2026)) build global anchors by merging deepest level features from SAM and external RGB encoders, using non-local connections to broadcast this anchor to shallow levels, countering information decay.

4. Training Objectives and Loss Formulations

SPDA-SAM training strategies are multi-objective, balancing prompt generation, depth estimation, and mask segmentation:

Region Proposal Loss: Standard RPN (Region Proposal Network) losses for learning bounding boxes or region masks (classification cross-entropy + smooth-L1).
Segmentation Losses: Pixel-wise cross-entropy and/or Dice coefficient losses are applied to both coarse and refined mask predictions. For instance, $\mathcal{L} = \mathcal{L}_\text{BCE} + \mathcal{L}_\text{Dice}$ in (Rafaeli et al., 2024), and aggregated as $L = \lambda_1 L_\text{RPN} + \lambda_2 L_\text{Coarse} + \lambda_3 L_\text{Refined}$ in (Shang et al., 6 Feb 2026).
Multi-Task and Deep Supervision: Methods with joint depth and segmentation supervision combine smooth-L1 losses on predicted depths with segmentation losses (e.g., $L_\text{total} = \alpha L_\text{depth} + \beta L_\text{seg}$ (Shvets et al., 2023)), often with auxiliary supervision at multiple scales (Xie et al., 2 Feb 2025).
Parameter Updates: Most systems fine-tune only adapters or mask decoders; ViT-based SAM encoders are generally frozen or optimized with low learning rates (LoRA, (Shang et al., 6 Feb 2026)), preserving open-set generalization.

5. Benchmark Results and Ablation Analyses

Table: Representative SPDA-SAM Quantitative Results

Domain	Key Dataset	SPDA-SAM Result (mAP/Dice/IoU)	Relative Gain vs. Prior
Terrestrial instance segmentation	COME15K-E (Shang et al., 6 Feb 2026)	mAP = 68.4	+1.7
Urban driving	Cityscapes (Shang et al., 6 Feb 2026)	mAP = 31.5	+11.3
Camouflaged object detection	CAMO (Li et al., 6 Jan 2026)	$S_m = 0.906$	+0.031
Medical 3D segmentation	AMOS2022 (Xie et al., 2 Feb 2025)	Dice = 76.8	+2.3
Multi-view scene understanding	ScanNet (Shvets et al., 2023)	mIoU = 62.5	+6.4
Sinkhole segmentation	Yaen (Rafaeli et al., 2024)	IoU = 40.27	+28.7 (vs. DEM)

Ablations in these studies demonstrate:

C2FFM and SSSPM modules in (Shang et al., 6 Feb 2026) contribute mAP gains up to 2.5 on Cityscapes; both modules together yield the largest improvements.
Dense geometric prompts and graph-based fusion in (Li et al., 6 Jan 2026) yield a 2.9% increase in S-measure over baseline.
DfusedAdapters in (Xie et al., 2 Feb 2025) add ∼5% parameter overhead for a 1–2% Dice improvement on volumetric datasets.

6. Domain-Specific Extensions

Remote Sensing and Topography: SinkSAM applies closed depression logic to monocular depth, producing high-precision prompts for SAM that outperform photogrammetry-derived DEM baselines by a wide margin (Rafaeli et al., 2024).
Robotics and 3D Vision: Multi-view systems jointly estimate depth and semantic segmentation, with mutual benefit to both modalities (Shvets et al., 2023).
Medical Volumes: SPDA-SAM with DfusedAdapters achieves state-of-the-art 3D segmentation with <5% extra parameter count. Auxiliary masks and DT-based centroids fully automate prompt extraction (Xie et al., 2 Feb 2025).
Camouflaged Object Detection: Self-generated dense depth prompts and cross-modal graph enhancement address spectral camouflage by enforcing 3D spatial consistency (Li et al., 6 Jan 2026).

7. Limitations and Future Directions

SPDA-SAM’s effectiveness is bounded by the quality of monocular or multi-view depth estimation algorithms. Errors in geometric cues propagate directly to prompt generation and mask accuracy. The dual-stream or adapter strategies add computational and memory overhead due to multi-branch encoding and feature fusion operations, especially in large ViT backbones or volumetric contexts (Xie et al., 2 Feb 2025).

Potential research avenues, as highlighted in primary works, include:

Lightweight, end-to-end learnable depth fusion.
Joint RGB-depth and semantic pretraining.
Prompt generators leveraging text or learned semantic attributes.
Extension to video, multi-view, and temporal domains for robust cross-scene generalization.

SPDA-SAM represents a modular, extensible paradigm for self-prompted, depth-aware segmentation, demonstrating robust performance and cross-domain applicability by coupling automatic geometric prompting with explicit 3D feature modeling in the SAM framework (Shang et al., 6 Feb 2026, Rafaeli et al., 2024, Li et al., 6 Jan 2026, Xie et al., 2 Feb 2025, Shvets et al., 2023).