Spatial Mamba Vision Models

Updated 10 February 2026

Spatial Mamba is a family of vision models that leverages morphological operations, state-space models, and attention mechanisms for efficient spatial representation learning.
The models utilize patch-based tokenization, center-based gating, and linear-complexity recurrence to capture both local and long-range dependencies while preserving structural boundaries.
Its applications span hyperspectral image classification and other high-dimensional visual tasks, achieving state-of-the-art performance with a fraction of parameters compared to traditional architectures.

Spatial Mamba (SMamba) refers to a family of vision models that employ selective state-space architectures—often combining classical mathematical morphology, state-space modeling, and linear-complexity recurrence—for efficient and robust spatial representation learning in high-dimensional data. Originally proposed for patch-based hyperspectral image (HSI) classification, Spatial Mamba architectures have since been extended to broader visual tasks requiring 2D (or higher-dimensional) long-range context modeling, often operating with far fewer parameters and linear, rather than quadratic, computational complexity relative to Transformer-style self-attention (Ahmad et al., 2024, Rahman et al., 2024, Li et al., 9 Jan 2025). SMamba models are distinct in fusing morphological priors (through erosion and dilation), structured state-space models (SSMs), and learnable feature fusion, enabling accurate large-context classification at scale.

1. Architectural Overview and Core Pipeline

Spatial Mamba models for HSI typically process an input data cube $X\in\mathbb{R}^{H\times W\times C}$ patch-by-patch, converting each patch into a tokenized representation amenable to both local and long-range spatial reasoning. The canonical SMamba (as realized in MorpMamba) follows the pipeline below (Ahmad et al., 2024):

Patch Extraction: Extract overlapping or non-overlapping 3D patches $X_i$ of shape $P\times P\times C$ .
Token Generation: Each patch is simultaneously decomposed into spatial and spectral ‘views’ via morphological operation streams.
Morphology Block: Apply 2D depthwise separable convolutions to each view to perform mathematical erosion and dilation, capturing shape and boundary priors.
Token Enhancement: Re-weight tokens based on global context using a learnable “center” token as a gating mechanism.
Multi-Head Self-Attention: Independently enrich spatial and spectral tokens with Transformer-style self-attention, capturing complex intra-block relationships.
State Space Module (SSM): Model global sequence dependencies with a recurrent SSM across the enhanced token stream, achieving the effect of long-range receptive fields.
Classification Head: The final representation is passed through normalization and a linear classifier to yield softmax class labels.

This approach is encapsulated in the following dataflow:

Input HSI patch → {Spatial, Spectral morphological streams} → Erosion/Dilation → Tokenization → Center-based gating → Self-Attention → SSM Block → Classification (Ahmad et al., 2024).

2. Morphological Tokenization and Feature Encoding

The morphological token generation module is a signature innovation of SMamba models (Ahmad et al., 2024). Tokens are generated as follows:

Spatial Stream: Treats $H,W$ as spatial axes, applying 2D channel-wise (depthwise) convolutions (kernel size $5\times5$ ) to perform:

$E_{k}(X) = \min_{j\in\mathcal{N}}\big\{ X(i) - k(i-j)\big\}, \quad D_{k}(X) = \max_{j\in\mathcal{N}}\big\{ X(i) + k(i-j)\big\}$

where $k$ is the structuring element encoded in the convolution kernel.

Spectral Stream: Transposes the patch to $C\times P\times P$ , applies the same erosions and dilations, and projects with a 1×1 depthwise separable convolution back to token space.
Token Projection: Concatenated outputs of erosion and dilation are passed through a learnable $1\times1$ depthwise conv to produce either $t_{\rm spatial}\in\mathbb{R}^{M\times D}$ or $t_{\rm spectral}\in\mathbb{R}^{N\times D}$ .

The design preserves intrinsic spatial boundaries and enhances noise-robustness, providing a strong inductive bias for structured image domains.

3. Token Enhancement, Attention, and State-Space Modeling

After feature extraction, tokens are adapted by gating with a learned central token: $F_{\rm spatial} = G_{\rm spatial}\odot t_{\rm spatial}, \quad G_{\rm spatial} = \sigma(t_{\rm center}W_c^{\rm spatial} + b_{\rm spatial})$ with an analogous equation for the spectral branch. This approach leverages the importance of the patch center for robust feature fusion (Ahmad et al., 2024).

Multi-head self-attention is then performed per token stream: $Q_i = F W_Q^{(i)},\quad K_i = F W_K^{(i)},\quad V_i = F W_V^{(i)}$

$\text{Attention}_i = \operatorname{softmax}\left( \frac{Q_i K_i^{T}}{\sqrt{d_k}} \right) V_i$

All attention heads are concatenated and projected, providing contextually-enriched representations prior to state-space modeling.

The SSM block subsequently recurrently updates a hidden state: $h_t = \operatorname{ReLU}(W_{\rm trans}h_{t-1} + W_{\rm upd}E_t)$ The SSM efficiently models dependencies with linear complexity in the sequence length, in contrast to the quadratic scaling of attention.

4. Computational Efficiency: Linear Complexity

A central contribution of SMamba is its strictly linear scaling with respect to sequence length $L$ . At each step:

Morphological and attention blocks: $O(LD)$ per layer.
SSM block: $O(LD)$ .

In total, $O(LD)$ dominates—compared to pure Transformer models at $O(L^2D)$ due to the quadratic cost of forming attention maps (Ahmad et al., 2024, Rahman et al., 2024). As a result, SMamba remains computationally efficient even as patch or band size grows, making it suitable for high-resolution HSI and dense vision tasks.

5. Empirical Results and Comparative Performance

SMamba models (MorpMamba) establish state-of-the-art or near-SOTA accuracy on widely used HSI benchmarks (UH, LK, PU, PC, SA) (Ahmad et al., 2024). Representative results for 20% training ratio, patch size $4\times4$ :

Dataset	3D CNN OA (%)	SMamba OA (%)	3D CNN Params	SMamba Params
UH	99.01	98.28	4,042,751	67,013
LK	99.81	99.70	4,041,977	66,239
PU	98.70	97.67	—	—
PC	99.87	99.71	—	—
SA	98.86	98.52	—	—

Notably, SMamba achieves comparable or better accuracy with a parameter count ( $\approx 66{,}000$ – $67{,}000$ ) that is one to two orders of magnitude lower than CNN or transformer baselines, affirming both the representational and efficiency advantages of the framework. Training and inference time scales linearly with patch size, in line with theoretical predictions (Ahmad et al., 2024).

6. Practical Advantages and Applicability

Linear Complexity: Owing to its SSM core and efficient tokenization, SMamba is intrinsically more scalable than convolutional or self-attention backbones for long-sequence visual tasks (Ahmad et al., 2024, Rahman et al., 2024).
Morphological Operations: The incorporation of mathematical morphology enhances robustness to image noise and preserves structural information, particularly valuable for real-world remote sensing and biomedical HSI data.
Adaptive Fusion: The center-based gating and per-block attention allow for dynamic fusion of spatial and spectral cues, improving representation power across diverse scenes (Ahmad et al., 2024).
SOTA Efficiency: Model size and FLOPs are dramatically reduced, making SMamba suitable for deployment in low-resource environments.
Extensibility: The modular architecture admits integration with additional feature enhancement (token gating, mixture-of-experts) and domain-specific priors (Rahman et al., 2024, Li et al., 9 Jan 2025).

7. Position within the Mamba and State-Space Model Ecosystem

Spatial Mamba constitutes one of several variants targeted at visual sequence modeling, distinguished from other SSM-based approaches (e.g., patchwise SSM, bidirectional state updates, structure-aware fusion) by its explicit use of morphological preprocessing and structured spatial/spectral tokenization (Rahman et al., 2024). While scanning-order artifacts and reduced pre-training availability remain open challenges, SMamba’s unique blend of morphologically-informed feature extraction and linear state-space modeling provides a highly competitive backbone for hyperspectral and general image analysis.

8. Concluding Summary

Spatial Mamba (SMM) synthesizes the computational efficiency of state-space models, the inductive bias of mathematical morphology, and the context-capturing power of attention to deliver lightweight but high-performing classifiers for hyperspectral image analysis and other high-dimensional vision domains. The architecture’s linear complexity, adaptive fusion mechanisms, and robust tokenization strategies collectively set a new benchmark for efficient large-context modeling, as evidenced by SOTA results across multiple public benchmarks and proven scalability with increasing spatial or spectral data dimensionality (Ahmad et al., 2024, Rahman et al., 2024).

References:

(Ahmad et al., 2024): Spatial and Spatial-Spectral Morphological Mamba for Hyperspectral Image Classification (Rahman et al., 2024): Mamba in Vision: A Comprehensive Survey of Techniques and Applications (Li et al., 9 Jan 2025): MambaHSI: Spatial-Spectral Mamba for Hyperspectral Image Classification

Markdown Report Issue Upgrade to Chat

References (3)

Spatial and Spatial-Spectral Morphological Mamba for Hyperspectral Image Classification (2024)

Mamba in Vision: A Comprehensive Survey of Techniques and Applications (2024)

MambaHSI: Spatial-Spectral Mamba for Hyperspectral Image Classification (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Spatial Mamba (SMamba).