Papers
Topics
Authors
Recent
Search
2000 character limit reached

Tri-Plane Mamba: 3D Segmentation Adapter

Updated 7 February 2026
  • Tri-Plane Mamba is a parameter- and data-efficient adaptation module that augments SAM’s ViT encoder for enhanced 3D medical image segmentation.
  • It integrates lightweight multi-scale 3D convolutional adapters with a tri-plane state-space model to capture both local and global volumetric context with minimal computational overhead.
  • Empirical results demonstrate state-of-the-art Dice scores on CT and MRI datasets, ensuring robust performance and rapid convergence even under extreme data scarcity.

Tri-Plane Mamba (TP-Mamba) is a parameter- and data-efficient adaptation module designed to augment the Segment Anything Model (SAM) ViT encoder for 3D medical image segmentation. By combining lightweight multi-scale 3D convolutional adapters and a tri-plane Mamba state-space model (SSM) module, TP-Mamba introduces strong volumetric context-awareness while maintaining the computational and memory efficiency required for clinical-scale volumetric imaging benchmarks. This hybrid adaptation architecture achieves state-of-the-art segmentation accuracy on challenging 3D CT and MRI datasets, even under extreme data scarcity, while preserving nearly all pre-trained 2D SAM weights and incurring only minor additional parameter and FLOP overhead (Wang et al., 2024, Shahraki et al., 31 Jan 2026).

1. Motivation and Background

Foundation models such as SAM have demonstrated strong capabilities for 2D image segmentation but are architecturally constrained to process each slice of a 3D medical volume independently, lacking inter-slice contextual modeling. A naive extension of SAM to volumetric data by replacing all 2D operations with their 3D counterparts (e.g., 3D self-attention, 3D convolutions) leads to an intractable increase in computational complexity—specifically, quadratic in the number of patches along the added depth axis, rendering direct 3D attention prohibitively expensive for practical scan sizes (Wang et al., 2024).

Standard 3D CNNs can encode local volumetric context but do not capture global dependencies efficiently, and their cost grows cubically with kernel support. Hybrid approaches are needed to enable 3D context modeling that avoids catastrophic increases in memory and computational requirements, while reusing pre-trained 2D representations and requiring only minimal fine-tuning (Shahraki et al., 31 Jan 2026).

2. Architectural Formulation

TP-Mamba augments the frozen SAM ViT encoder with adapter modules, each consisting of both a multi-scale 3D convolutional block and a tri-plane Mamba SSM block. Only the adapters' parameters are trained; the underlying SAM weights remain fixed.

Multi-Scale 3D Convolutional Adapters

Let the input tensor at a given SAM ViT block be FinRB×C×D×h×wF_\mathrm{in}\in\mathbb{R}^{B\times C\times D\times h\times w}. The adapter first projects CrC\rightarrow r channels (rCr\ll C via 1×1×3×1×11\times 1\times 3\times 1\times 1 convolution along the depth axis), followed by four parallel 3D convolutions with depth-wise dilations d{1,2,4,8}d \in \{1,2,4,8\}. These efficiently capture multi-scale local depth-wise features while incurring parameter cost orders-of-magnitude below that of a full 3D projection (overhead per block 4r23+2rC3\sim 4\cdot r^2\cdot 3 + 2\cdot r\cdot C\cdot 3, with rC/4r\sim C/4) (Wang et al., 2024).

Tri-Plane Mamba Module

After local feature extraction, the output volume FlocF_\mathrm{loc} is recast into three orthogonal 2D planes (axial/height–width, coronal/depth–height, sagittal/depth–width) by flattening along each major axis. Each planar token sequence is then processed with a dedicated Mamba SSM block. Mamba, a selective, input-dependent SSM, replaces quadratic-time self-attention with O(N)O(N) token processing, enabling efficient modeling of long-range dependencies:

hk=Akhk1+Bkxk,yk=Ckhk.h_k = A_k h_{k-1} + B_k x_k,\quad y_k = C_k h_k.

For each plane, the output is reconstituted and fused (by simple summation or concatenation) and then mapped to the target channel dimension. The 3D-aware output is residually summed with the original ViT output, preserving pre-trained 2D features (Wang et al., 2024, Shahraki et al., 31 Jan 2026).

3. Integration and Computational Considerations

In each SAM ViT block, adapter modules are injected post-MLP/LayerNorm, following two strategies:

  1. LoRA (Low-Rank Adaptation) adapters inside the Multi-Head Self-Attention (MSA)—injecting minimal trainable parameters to adapt pre-trained representations.
  2. Insertion of the multi-scale 3D convolutional adapter and tri-plane Mamba block as a residual stream.

The SAM mask decoder remains unchanged, receiving input features now endowed with 3D volumetric context. The additional computational burden of TP-Mamba is marginal (∼0.5 GFlops per block for the full adapter module), compared to ∼18.9 GFlops for a full 3D-ViT block (Wang et al., 2024). Memory and computational requirements are thus compatible with large-scale clinical datasets, supporting efficient batched inference and training protocols (e.g., batch sizes fitted to 40–100 GB GPU memory) (Shahraki et al., 31 Jan 2026).

4. State-Space Model Details and Frequency-Aware Extensions

The Mamba SSM, parameterized as input-dependent (“selective state”) linear recurrences, is discretized as:

hk=Aˉhk1+Bˉxk,yk=Cˉhk+Dˉxk,h_k = \bar A h_{k-1} + \bar B x_k,\quad y_k = \bar C h_k + \bar D x_k,

and extended, per TP-Mamba, to model inter-slice dependencies in all three anatomical planes. A key advantage is strictly linear complexity in sequence length versus the O((DHW)2)O((DHW)^2) scaling of full 3D attention.

The TP-MFGC (Tri-Plane Multi-Frequency Gated Convolution) modification replaces the local 3D CNN context path with a joint spatial-frequency analysis, leveraging the 3D Discrete Cosine Transform (DCT) to manipulate local and harmonic components with adaptive channel- and frequency-domain gating. Gates are generated from pooled DCT coefficients and applied prior to reconstruction, enhancing feature discrimination without significant runtime penalties (Shahraki et al., 31 Jan 2026).

5. Empirical Results and Data Efficiency

On the BTCV abdominal CT dataset (Wang et al., 2024), TP-Mamba achieves superior Dice segmentation scores across all data regimes. With full data (30 volumes), the average Dice is 84.8% (+1.7% over the best conventional 3D network); with only three training samples (12%), it remains 12.3% higher (65.8% vs. 53.5%). TP-Mamba converges to optimal performance within 200 epochs, five times faster than baselines.

On the ACDC cardiac MRI dataset (Shahraki et al., 31 Jan 2026), the adapter-based TP-MFGC variant achieves $0.880$ mean Dice at $4.77$ volumes/second inference speed with moderate VRAM requirements (12.6\sim12.6 GB), while the LoRA-only adapter achieves $0.796$ mean Dice and minimal memory load (1.9\sim1.9 GB). The dual-branch MambaSAM-Base, an alternative that fuses SAM and VMamba features via cross-attention, attains the top mean Dice ($0.906$) but at reduced inference throughput.

6. Implications, Limitations, and Extensions

TP-Mamba demonstrates that coupling parameter-efficient adapter designs with SSM-based long-range context modeling can bridge the gap between 2D foundation models and 3D clinical imaging needs. Over 99% of SAM weights remain fixed, with only ∼2–20% of parameters tuned (depending on variant), yielding strong anatomical segmentation—even in extreme few-shot or low-annotation regimes.

Limitations include current customization solely of the image encoder; prompt encoders and mask decoders are unmodified. Volumes with highly anisotropic axes or extreme slice-deep stacks may require further adaptation, such as scale-adaptive tri-plane strategies. This suggests further hybridization or prompt-aware adaptation may enhance interactive performance and generality in clinical scenarios (Wang et al., 2024, Shahraki et al., 31 Jan 2026).

7. Comparative Summary

Variant Trainable Params Mean Dice Inference Speed Memory Use (GB)
TP-MFGC ~22.9M (20%) 0.880 4.77 FPS ~12.6
LoRA-only Adapter minimal 0.796 0.72 FPS ~1.9
Dual-branch Base ~22.9M 0.906 2.78 FPS not specified

The empirical evidence establishes TP-Mamba as an effective, scalable, and resource-conscious approach to 3D segmentation, leveraging the architectural and weight-efficiency of SSM augmentation and multi-scale local/global fusion strategies (Wang et al., 2024, Shahraki et al., 31 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Tri-Plane Mamba (TPMamba).