MambAttention: Efficient Hybrid Attention

Updated 17 February 2026

MambAttention is a class of attention-like mechanisms that leverage Mamba selective state-space models to efficiently capture both local and long-range dependencies via a selective scan.
It integrates multi-head and multi-scale attention modules—such as A2Mamba and MambaCAFU—to combine explicit spatial, temporal, and cross-modal features for enhanced performance.
The approach demonstrates linear scaling with reduced computational overhead, while providing interpretable attention matrices and robust empirical results across vision, audio, and sequential tasks.

MambAttention refers to a general class of attention-like mechanisms built on top of or in concert with the Mamba selective state-space model (SSM), which has emerged as a scalable alternative to the Transformer self-attention mechanism. MambAttention encompasses a spectrum of architectures that integrate Mamba blocks and attention operations, either by reinterpreting Mamba’s intrinsic recurrence as implicit attention, fusing Mamba modules with explicit multi-head attention, augmenting SSMs with multi-scale or cross-modal affinity maps, or hybridizing SSMs and attention mechanisms for domain-specific tasks in vision, audio, speech, and beyond.

1. Selective Scan Mechanism and its Attention Interpretation

At its core, a Mamba block implements a selective scan over an input token sequence $x_1, x_2, \ldots, x_n \in \mathbb{R}^d$ using a short learned kernel or vector of trainable scan weights $w = [w_0, w_1, \ldots, w_{L-1}]$ , typically with $L \ll n$ . Mamba computes, for each token position $t$ :

$s_t = \sum_{\ell=0}^{L-1} w_\ell \, x_{t-\ell} \quad (\text{with } x_{t-\ell}=0 \text{ for } t-\ell<1)$

and optionally applies a feedforward or pointwise nonlinearity. This can be interpreted as introducing, for each output token $t$ , a non-negative normalized attention weight on the prior $L$ tokens:

$\alpha_{t,j} = \begin{cases} \dfrac{\exp(w_{t-j})}{\sum_{k=1}^{t} \exp(w_{t-k})}, & 1 \leq j \leq t \ 0, & \text{otherwise} \end{cases}$

Thus, the Mamba scan implicitly defines a lower-triangular (causal) attention matrix, with scan weights $w$ trained so the network can adaptively model long-range and local dependencies. The global computation is implemented via a convolutional scan or parallel associative scan for hardware efficiency, yielding $O(n d L)$ ( $\approx O(n d)$ ) compute per layer—substantially lower than the $O(n^2 d)$ complexity of standard Transformer self-attention (Wang et al., 28 Feb 2025, Ali et al., 2024).

2. Lifting MambAttention to Structured Domains: Vision and Beyond

To apply MambAttention to images, the standard procedure is to decompose an image into a sequence of non-overlapping $p \times p$ patches, flatten each to a $d$ -dimensional vector, and stream them into the selective scan. Since the scan is causal—the output at step $t$ only depends on previous inputs—the order in which 2D patches are serialized into a 1D sequence strongly affects the spatial relationships Mamba can model. In vision-centric models, this ordering imposes bias:

Four-direction (cross-scan): left→right, top→down, right→left, bottom→up, each yielding a scan of all patches. The output attention matrices for each direction are averaged, guaranteeing that each patch can "see" every other patch after merging (Wang et al., 28 Feb 2025).
Alternative orders: diagonal scan (tracing $\sqrt{2}$ diagonals), Morton (Z-order) space-filling curves, or spiral scans preserve various localities and neighbor relationships (see Table below).

Patch Order	Qualitative Effect on Attention Clusters
Cross-scan	Merged, broad global context
Diagonal	Tight clusters along diagonals
Morton (Z-order)	Clusters matched to grid blocks in Z hierarchy
Spiral	Prioritizes immediate-predecessor focus in spiral

Despite significant differences in local clustering of attention matrices, all ordering schemes yield similar top-1 ImageNet accuracy ( $\sim$ 82.6%), confirming the robustness of the scan-based approach. Clustering structure, however, changes to match the imposed precedence (Wang et al., 28 Feb 2025).

Extensions of MambAttention fuse the selective scan with explicit attention modules or extra spatial/semantic modalities, enabling more expressive modeling for structured tasks:

A2Mamba (MASS mixer): Employs a two-stage token mixer: Adaptive Multi-scale Attention (AMA) extracts local (sliding-window) and global (dilated sliding) affinity maps. An Attention-augmented SSM (A2SSM) receives these multi-scale maps, applying cross-attention to spatially aggregate SSM hidden states at fine and coarse granularities before gating and combining outputs. This hybridization recovers true 2D spatial dependencies, overcomes 1D scan causality, and delivers superior performance on classification, segmentation, and detection tasks compared to ConvNet, Transformer, and pure Mamba baselines (Lou et al., 22 Jul 2025).
MambaCAFU (MAF): Fuses local CNN, global Transformer, and Mamba SSM streams. Local (ResNet-derived) features undergo spatial attention, Transformer features encode global context, and MambaConv (using a 2D SSM scan) provides linear-complexity long-range dependency capture. A co-attention gate (CoAG) exchanges salient regions between streams. Experimental results confirm 1–2% segmentation accuracy drop if any component is removed, highlighting the necessity of joint multi-attention (Bui et al., 4 Oct 2025).
StableMamba: Interleaves Mamba and Transformer-attention layers at a fixed ratio (e.g., 7:1). This design resolves instability and scale limitations of large SSMs, maintains near-linear scaling, and delivers top-tier performance in image and video classification, as well as robustness against input artifacts (Suleman et al., 2024).

4. Algorithmic and Mathematical Structure

In all variants, the defining property of MambAttention is the selective, input-dependent gating and recurrent scan, possibly combined with classic (multi-head) attention. The general mathematical template can be summarized as:

Selective SSM (Mamba) update:

$s_t = A s_{t-1} + B x_t$

$m_t = s_t \odot g(x_t)$

where $A, B$ are learned or input-dependent matrices, and $g(x_t)$ is a learned gating.

Implicit attention matrix (as shown in (Ali et al., 2024)):

$y_t = \sum_{j=1}^t \Big[ C_t \prod_{k=j+1}^t \bar A_k \bar B_j \Big] x_j$

where the multiplicative term can be viewed as a soft, possibly non-stationary attention weight.

Explicit multi-head attention fusion: In hybrid designs, token features from Mamba blocks are passed through standard self- or cross-attention mechanisms (e.g., Transformer multi-head attention), and outputs are merged via addition, gating, or concatenation, enforcing either time/frequency or modal invariance as needed (Kühne et al., 1 Jul 2025).

In structured domains, multi-directional scans (2D or 3D), bidirectional recurrence, and cross-modal fusion modules (e.g., Shape Extractor Module, co-attention gates) are incorporated. Complexity remains $O(N d L)$ per layer for pure scan-based Mamba, rising to $O(N^2 d)$ only if a Transformer attention block is present.

5. Visualization, Analysis, and Explainability

Interpreting MambAttention involves extracting and interpreting the effective attention weights embodied in the scan kernel. Visual analytics tools have been developed to map learned attention matrices to embeddings and patch grids, revealing how block-wise and intra-block attention clusters evolve during training and across layers:

Scatterplot view: Projects block-level or patch-level attention patterns onto low-dimensional embeddings (via PCA, t-SNE, UMAP). Inter-block clustering indicates that even structurally similar Mamba blocks in the same stage can develop orthogonal attention patterns.
Patch view: Maps attention strengths to spatial locations, highlighting "fan-in" patterns of which patches attend to a selected token. Early layers maintain smooth, locality-preserving attention; deeper layers encode more content-driven, context-sensitive dependencies. This visual–analytic approach reveals the emergent behavior of scan-computed attention matrices (Wang et al., 28 Feb 2025).

Theoretical analysis confirms that the Mamba SSM, unrolled into an explicit operator, yields a lower-triangular attention matrix algebraically similar to causal self-attention, but with continuous decay rather than softmax normalization and more flexibility in integrating temporal and spatial bias terms (Ali et al., 2024).

6. Empirical Performance and Scalability

Across vision, speech, 3D point cloud, sequential prediction, and audio separation tasks, MambAttention modules deliver strong empirical gains:

Vision: Comparable or superior ImageNet-1K top-1 accuracy ( $>$ 86% in A2Mamba-L), superior segmentation mIoU, and better throughput than traditional Transformer or ConvNet backbones (Lou et al., 22 Jul 2025, Suleman et al., 2024).
Medical/biomedical imaging: MambaCAFU and RMA-Mamba set state-of-the-art Dice coefficients across multiple segmentation datasets (e.g., 96.76% on GlaS, 92.08% on CirrMRI600+) at competitive or lower FLOPs (Bui et al., 4 Oct 2025, Zeng et al., 23 Feb 2025).
Speech and audio: MambAttention hybrids (e.g., SSM + shared time-frequency MHA) outperform pure Mamba and traditional LSTM/conformer models by substantial margins (e.g., +5.87 dB SI-SDR on out-of-domain data) and demonstrate enhanced cross-corpus generalization (Kühne et al., 1 Jul 2025, Zhang et al., 2024).
Long-context reasoning: Hierarchical Sparse Attention (HSA) fuses precise random chunk access and O(L) sequence modeling, demonstrating 100% passkey retrieval at up to 64M length, far beyond full-attention or other sparse alternatives (Hu et al., 23 Apr 2025).
Trajectory, channel estimation: MambAttention reduces parameters by 40%+ and FLOPs by up to 4× while matching or exceeding SOTA in prediction accuracy and run-time in both multi-agent forecasting (Huang et al., 13 Mar 2025) and massive channel estimation (Luan et al., 23 Jan 2026).

7. Theoretical and Practical Implications

MambAttention mechanisms fundamentally generalize the representational capabilities of SSMs by permitting flexible, content-dependent, and efficient attention patterning. By trading explicit pairwise softmax attention for trainable, recurrent, often multi-directional scans, MambAttention achieves linear scaling—key for long sequences, large images, videos, or batch sizes—while maintaining the adaptability of attention:

Overcomes key RNN limitations (lack of random access, poor scaling to very long contexts) via fusion with sparse/hierarchical or explicit attention modules (e.g., HSA in RAMba, MASS mixer, MambaCAFU).
Enables flexible preservation or violation of spatial, temporal, and modal localities via patch-ordering, directional scan scheduling, and fusion with cross-modal affinity maps.
Admits extraction of interpretable hidden attention matrices for analysis and explainability, supporting interpretability at both block and token level (Ali et al., 2024, Wang et al., 28 Feb 2025).
Shows robustness to input modifications and stability at scale beyond pure SSMs and, when interleaved with sparse attention, enables both generalization and stable very-large-model training (Suleman et al., 2024, Hu et al., 23 Apr 2025).

MambAttention thus encapsulates a design space for efficient, expressive, and scalable attention-like mechanisms suitable for large-scale, structured representation learning across domains.