Saliency Mamba: SSM-Powered Saliency Models
- Saliency Mamba is a family of state-space models for saliency detection that leverages linear-time recurrences to capture long-range dependencies.
- Innovations such as the Saliency-Guided Mamba Block and Context-Aware Upsampling enhance spatial scanning and refine feature resolution.
- The framework demonstrates efficiency and adaptability across diverse applications, including image SOD, EEG representation, and mesh saliency prediction.
Saliency Mamba (Samba) refers to a family of models and architectural approaches that employ state-space models—specifically, the Mamba framework—to predict and analyze visual saliency, attention, or related cognitively-relevant signals across vision, multimodal, and temporal domains. While the term "Samba" appears in multiple related works, the most comprehensive and unified version is Samba+ (Zhao et al., 2 Feb 2026), a general Mamba-based salient object detection (SOD) framework. Distinct architectural variants under the "Saliency Mamba" designation have also been proposed for image-based SOD (Zhao et al., 2 Feb 2026), lightweight driver-attention estimation (Zhao et al., 22 Feb 2025), EEG representation learning (Hong et al., 23 Nov 2025), multi-focal light-field salient object detection (Liu et al., 2024), mesh saliency prediction (Zhang et al., 2 Apr 2025), and temporal video grounding (Li et al., 2024). This article focuses primarily on the methodologies, algorithmic innovations, empirical outcomes, and the role of Mamba SSMs in driving the recent advances in saliency modeling under the "Samba" paradigm.
1. Foundations: State-Space Models and the Mamba Backbone
Saliency Mamba architectures are grounded in linear-time state-space models (SSMs), with the core dynamical system given by
where the state is updated via input and output maps the current state to the observed signal. The zero-order hold discretization yields an efficient recurrence
with parameters learned and potentially modulated by the input itself. The Mamba architecture further introduces input-dependent gating and modulation into the system matrices to enable flexible, adaptive sequence modeling at linear runtime cost.
In Samba and its variants, SSMs replace conventional convolutional (CNN) or Transformer-based self-attention modules as the principal global context modeling mechanism, enabling both efficiency and long-range dependency capture. This SSM backbone is leveraged in 2D images (patch-wise scan), multichannel time series (EEG), temporal video, graph-structured mesh data, and cross-modal multi-stream settings (Zhao et al., 2 Feb 2026, Zhao et al., 22 Feb 2025, Hong et al., 23 Nov 2025, Liu et al., 2024, Zhang et al., 2 Apr 2025, Li et al., 2024).
2. Saliency-Guided Modeling and Architectural Components
Saliency-Guided Mamba Block (SGMB) and Spatial Neighborhood Scanning (SNS)
A fundamental innovation in Samba+ (Zhao et al., 2 Feb 2026) is the Saliency-Guided Mamba Block (SGMB), which introduces a spatial neighborhood scanning (SNS) procedure to align the flattened SSM input sequence with the contiguous layout of salient regions in images. SNS constructs patch orderings that traverse salient areas coherently, preserving spatial structure when the model operates on vectorized feature representations. Multiple scan variants (directional, reverse) are stacked to increase robustness before merging feature outputs.
Context-Aware Upsampling (CAU)
Standard upsampling can introduce spatial artifacts and lose contextual detail. Samba's CAU mechanism combines features from low- and high-resolution decoder stages by grouping local neighborhoods and processing them jointly via an SSM, allowing context-aware super-resolution through learned causal prediction. Merged outputs reconstruct high-frequency saliency that aligns with local structure (Zhao et al., 2 Feb 2026).
Multi-Modal Fusion: Hub-and-Spoke Graph Attention (HGA)
For multi-modality or multi-stream inputs (e.g., RGB, RGB-D, RGB-T, video, thermal), Samba+ employs the HGA module: each input is treated as a "spoke," fusing through a shared "hub" node with graph-attention layers derived from GATv2 [Brody et al., 2021]. The hub state aggregates information from all modalities and is projected into the bottleneck representation for unified SOD prediction.
Modality-Anchored Continual Learning (MACL)
Samba+ addresses catastrophic forgetting and inter-modal conflict in continual, multi-task training through MACL. Training is staged: first, an anchor backbone is established with RGB data, then auxiliary modalities are incorporated with replay and anchor weight regularization, and finally a full multi-modal curriculum imposes both anchor and distillation losses to ensure knowledge retention and stable adaptation (Zhao et al., 2 Feb 2026).
3. Unified and Specialized Samba Architectures
Although Samba+ is the most general incarnation, numerous application-specific and architectural variants implement the Saliency Mamba principle:
- SUM (Saliency Unification through Mamba) (Hosseini et al., 2024): Unified visual attention modeling across domains (natural scenes, e-commerce, UI, web) via a patchwise U-Net backbone with Conditional Visual State Space (C-VSS) decoder blocks. Learns lightweight prompt tokens per domain.
- SalM² (Zhao et al., 22 Feb 2025): An extremely lightweight (0.08M parameters) dual-branch Mamba CLIP network for real-time driver-attention prediction, fusing bottom-up (image) and top-down (semantic) information via cross-modal attention.
- SAMba-UNet (Huo et al., 22 May 2025): Combines a frozen foundation model (SAM2) and Mamba-UNet dual encoder, with a heterogeneous attention fusion module for boundary-sensitive medical image segmentation.
- LFSamba (Liu et al., 2024): Adapts Mamba SSMs to multi-focus light-field imaging by inter-slice and inter-modal fusion for salient object detection, supporting both full and scribble-supervised training.
- SAMBA (EEG) (Hong et al., 23 Nov 2025): A U-shaped Mamba model for long-sequence EEG, with 3D spatial-adaptive input embedding and a Multi-Head Differential Mamba module that suppresses redundancy and highlights salient temporal phenomena.
- Mesh Mamba (Zhang et al., 2 Apr 2025): Models mesh saliency using SSMs over topology-preserving subgraph embeddings, unifying geometric and texture cues.
- SpikeMba (Li et al., 2024): Integrates spiking neural networks (SNNs) as a saliency detector with SSM-based contextual reasoning for temporal video grounding.
4. Mathematical and Algorithmic Foundations
Core mathematical innovations underlying Samba and related models include:
- SSM Recurrence: Adopted for both global context modeling and fine-grained temporal/structural signal propagation, with parameters often modulated by input or auxiliary signals.
- Saliency-Informed Sequence Construction: SNS or analogous procedures permute state-space sequence order to encode spatial adjacency or salient regions.
- Task-Specific Losses: Segmentation, attention, or saliency maps are predicted via combinations of binary cross-entropy, weighted IoU, Dice, KL, CC, SIM, and MSE losses. To reconcile heterogeneous domains, composite objectives with tailored weighting are used.
- Unification via Conditional Modulation: Prompt tokens and conditional normalization in SUM and others allow domain-specific adaptation in a single parameter set.
- Differential and Phase-Aware Attention: In motion generation (T2M Mamba (Zhan et al., 1 Feb 2026)), coupling saliency weights, phase encodings, and differential rotary attention modules achieves direct control over event/periodicity alignment.
5. Empirical Results and Benchmarks
Quantitative results consistently place Samba-based models at state-of-the-art or competitive levels with significantly reduced parameter counts or computational cost:
| Model/Paper | Task/Domain | Key Metric(s) | Standout Result (example) |
|---|---|---|---|
| Samba+ (Zhao et al., 2 Feb 2026) | General SOD (6 tasks) | , , | on VDT-2048 |
| SUM (Hosseini et al., 2024) | Visual Attention, 6 datasets | CC, KLD, NSS, SIM, AUC | SOTA in 27/30 metrics |
| SalM² (Zhao et al., 22 Feb 2025) | Driver Attention | AUC, CC, SIM, KLD, NSS | 0.09–11% params of SOTA, 98% performance |
| SAMBA (Hong et al., 23 Nov 2025) | EEG representation | ACC, AUROC | Up to +10–16% ACC boost in transfer |
| SAMba-UNet (Huo et al., 22 May 2025) | Cardiac MRI Segmentation | mDice, HD95 | mDice 0.9103, outperforming best prior |
| LFSamba (Liu et al., 2024) | Light-Field SOD | , , MAE | , MAE=0.039 on LFSD |
| Mesh Mamba (Zhang et al., 2 Apr 2025) | 3D Mesh Saliency | CC, SIM, KLD, SE | CC 0.6763 vs prev. best 0.6181 |
Ablation studies demonstrate the efficacy of saliency-guided encoding (SGMB/SNS), context-aware upsampling (CAU), and continual learning (MACL): removing or replacing these modules causes significant metric degradation across datasets and modalities (Zhao et al., 2 Feb 2026). In lightweight contexts (SalM²), model sizes are reduced by up to two orders of magnitude with no substantial performance loss (Zhao et al., 22 Feb 2025).
6. Applications and Generalization
- Image and Video SOD: Samba+ sets new accuracy–efficiency tradeoffs across classic, RGB-D, thermal, video, and unified tasks (Zhao et al., 2 Feb 2026).
- Unified Multimodal Saliency: Single trained models robustly process arbitrary modality combinations (e.g., Samba+ on VDT SOD), and transfer to camouflaged object detection and medical segmentation without retraining (Zhao et al., 2 Feb 2026).
- Time-Series and Non-Euclidean Structures: EEG foundation modeling (SAMBA (Hong et al., 23 Nov 2025)), mesh saliency (Mesh Mamba (Zhang et al., 2 Apr 2025)), and temporal video understanding (SpikeMba (Li et al., 2024)) demonstrate the breadth of SSM-based saliency approaches.
- Real-Time and Embedded Systems: The efficiency of Mamba blocks enables deployment in real-time, low-latency environments such as driver monitoring (SalM²) and ultrasound navigation (UltrasODM, not detailed here).
7. Limitations and Prospects
Despite unification and parameter efficiency, label diversity remains a prerequisite—SUM and related models still require annotated data from all domains for optimal generalization (Hosseini et al., 2024). Extending zero-shot generalization to unseen modalities and reducing further latency via sparse or mixed-precision SSMs are identified as critical future directions. Spatiotemporal extensions and adaptive scanning schemes promise improved transfer to video and beyond. The Samba line of research establishes SSM-based architectures as a highly flexible backbone for both specialized and universal saliency modeling across diverse application regimes (Zhao et al., 2 Feb 2026, Hosseini et al., 2024, Zhao et al., 22 Feb 2025, Hong et al., 23 Nov 2025, Liu et al., 2024, Zhang et al., 2 Apr 2025, Li et al., 2024).