Mamba Family of SSM LMs

Updated 21 December 2025

The Mamba family comprises selective state-space models that replace quadratic attention with efficient, linear-time recurrent architectures featuring constant memory usage.
They enable robust long-context retention and dynamic input selectivity, supporting diverse applications in language, code, and multimodal tasks.
Hybrid variants like TransMamba and BlackMamba integrate attention and quantization, leading to faster inference and improved performance compared to standard models.

The Mamba family comprises LLMs built on selective, hardware-efficient structured state space models (SSMs). Mamba models supersede quadratic-complexity attention mechanisms with fully recurrent, input-selective SSMs engineered for linear time and constant memory across sequence length. This family of models targets efficient large-scale language modeling, long-context retention, and flexible hybridization with attention-based approaches. Mamba and its variants achieve, or surpass, transformer-level quality in language, code, and multimodal tasks, while enabling substantial inference speedup and memory savings.

1. Core Architecture and Selective State Space Recurrence

Mamba models eschew both attention and dense multi-layer perceptron (MLP) blocks for a unified block built around the Selective Structured State Space (S6) recurrent core. Each block processes an input matrix $X\in\mathbb R^{B\times D}$ and expands its feature dimension before splitting into two streams: one processed by a linear SSM, the other by a gating activation (SiLU/Swish), followed by elementwise multiplication and projection, ultimately wrapped in a residual connection and LayerNorm (Gu et al., 2023). The core SSM step discretizes a continuous-time system,

$\frac{dh}{dt} = A_c h(t) + B_c x(t),\quad y(t) = C h(t)$

using zero-order-hold and per-timestep input-selective parameterizations, yielding the discrete recurrence:

$h_t = A_t h_{t-1} + B_t x_t,\quad y_t = C_t h_t,$

with $A_t,B_t,C_t$ determined as input-dependent functions via shallow convolutions and softplus activation. This mechanism enables the gating of information flow and selective memory at each step.

Efficient computation is achieved via hardware-aware fused CUDA kernels implementing a parallel prefix scan, allowing for $O(BLDN)$ time and constant memory per token irrespective of sequence length. This scan is up to $40\times$ faster than naïve PyTorch and $>5\times$ faster than FlashAttention for long sequences (Gu et al., 2023).

2. Expressivity, Memory, and Approximation Power

Mamba's input-dependent parameterization is essential for content-sensitive reasoning. Theoretical analysis demonstrates that the S6 layer can approximate discontinuous functions—specifically, projections onto Haar wavelets—by dynamically choosing which input chunks to remember or forget, outperforming purely diagonal SSMs (S4D) in representing non-smooth functions (Huang et al., 13 Jun 2025). The model adaptively adjusts $\Delta_t$ to freeze or release memory, supporting arbitrarily long recall of selected information. Empirical validation confirms that for tasks including selective copying, associative recall (MQAR), and induction heads, Mamba achieves phase-transition thresholds in memory and embedding size consistent with theory, while S4D cannot recover certain functions or scales poorly with task complexity.

Memory decay in Mamba is exponential in token distance, governed by the product $\prod_{k} \exp(-\lambda \Delta_k)$ , but input selectivity can modulate this, enabling persistent memory where necessary. Models with state-dimension-wise selectivity ("Mamba- $\perp\Delta$ ") demonstrably extend associative recall capacity with richer state-dependent decay patterns.

3. Long-Range Dependency and Theoretical Foundations

Mamba SSMs are rigorously characterized as Lyapunov-stable learners: the maximal Lyapunov exponent $\lambda_{\max}\le0$ ensures any small perturbation vanishes exponentially over time, minimizing the risk of numerical or optimization instability in recurrent updates (Halloran et al., 2024). Mixed-precision and parameter-efficient fine-tuning (via low-rank updates on SSM buffers) are robustly supported, conferring throughput and memory improvements without destabilizing training.

However, the standard SSM block's long-range dependency (LRD), measured by the gradient of hidden state with respect to a remote input, decays exponentially with distance, paralleling RNN memory patterns and contrasting with the attention mechanism's potential to focus flexibly across arbitrary sequence spans (Ma et al., 4 Sep 2025). Extensions with bilinear interaction terms permit selective long-range retrieval while maintaining linear complexity and stability.

4. Hybrid Architectures and Extensions

Several recent models build upon the Mamba backbone, exploring hybrid and multimodal extensions:

TransMamba unifies transformer and Mamba regimes at the architectural level using shared QKV↔CBx weights and a memory converter, allowing each layer to operate in attention or SSM mode at different "TransPoints," determined by a schedule. This approach grants both transformer-style global context and SSM efficiency, enhances multitask and long-context generalization, and can significantly reduce training time (–25%) (Li et al., 31 Mar 2025).
DiffuApriel integrates a bidirectional Mamba backbone with masked diffusion language modeling for denoising. The BiMamba variant computes both forward and backward SSM scans, fuses outputs, and provides parallel full-sequence modeling. The DiffuApriel-H hybrid further interleaves attention blocks every $K$ Mamba blocks. These configurations yield up to 4.4 $\times$ inference throughput (pure BiMamba), 2.6 $\times$ for hybrid, and match or exceed transformer-level perplexity in text generation tasks (Singh et al., 19 Nov 2025).
BlackMamba combines Mamba with Mixture-of-Experts (MoE) architectures, alternating selective SSM blocks with sparsely-gated expert MLP submodules. This achieves high accuracy per inference or training FLOP, linear complexity in sequence length, and low-latency generation, outperforming dense transformer and Mamba baselines on zero-shot benchmarks at equivalent compute (Anthony et al., 2024).
Sparse-Mamba applies classical control-theoretic concepts by parameterizing SSMs with strictly controllable/observable companion forms, enforcing stability on the state matrix $A$ . This yields sparse, well-conditioned models with parameter reduction, improved perplexity, faster convergence, and architectures suitable for "Mamba 3"-style multi-channel extensibility (Hamdan et al., 2024).
Bi-Mamba aggressively binarizes the Mamba-2 SSM core to 1-bit weights (±1) in all large linear projections, leveraging scale/shift factors and preserving full-precision in critical transformations. This approach achieves an 85–90% reduction in model storage and 8–10 $\times$ energy reduction, with only modest accuracy loss relative to full-precision Mamba, outperforming all post-training quantization baselines (Tang et al., 2024).

5. Empirical Performance, Memory, and Information Retention

Comprehensive benchmarking demonstrates that Mamba-family models (130M–2.8B parameters) match or surpass state-of-the-art transformers on language modeling (e.g., Pile, LAMBADA), zero-shot classification, and code tasks while exhibiting strictly linear inference time and constant activation memory (Gu et al., 2023 Anthony et al., 2024). Large-context evaluation (up to 1M tokens) highlights continual improvement in perplexity, with no plateauing as in HyenaDNA or transformer architectures.

Nonetheless, Mamba’s fixed-size hidden state necessitates selective forgetting. Recent auto-encoder analyses reveal a strong recency bias in information retention and significant loss on rare or out-of-distribution tokens, particularly numerals, organization names, and dialectal variants when context length grows (Hossain et al., 17 Dec 2025). Omission rates for such tokens significantly exceed those of determiners or frequent vocabulary. Token retention is correlated with pretraining frequency ( $r\approx-0.82$ ), indicating an inductive bias for preserving high-frequency content. This points to the need for numeracy-aware objectives, alternative tokenizations, or augmented architectures for robust long-horizon recall.

6. Complexity, Efficiency, and Future Directions

The defining strength of Mamba-family models is the resource efficiency afforded by SSM recurrence: no expanding key-value cache, linear scaling in sequence length, and competitive or superior throughput to transformer models at the same scale. BlackMamba and Bi-Mamba demonstrate orthogonal efficiency gains—through sparsity/parameter sharing or extreme quantization—suggesting the SSM abstraction is hardware-friendly and extensible (Anthony et al., 2024 Tang et al., 2024). Applications range from open-vocabulary LLMs to multi-modal fusion and are well-poised to exploit future low-power or memory-bounded platforms.

Hybrid SSM+attention formulations, such as those introduced in TransMamba and DiffuApriel, balance efficient local mixing with flexible global context, ameliorating the expressivity-efficiency trade-off imposed by exponential memory decay in vanilla SSMs (Li et al., 31 Mar 2025 Singh et al., 19 Nov 2025 Ma et al., 4 Sep 2025). Theoretical and experimental advances in controllability, observability, and stability enable robust scaling and deeper stacking (Hamdan et al., 2024).

Table: Selected Mamba Family Models and Variants

Model	Key Innovation	Complexity	Efficiency/Acc.	Reference
Mamba	Input-Selective SSM (S6)	$O(L)$	5 $\times$ faster vs. Transformer	(Gu et al., 2023)
Mamba-2	Dual-form, per-timestep	$O(L)$	Improved compactness	(Huang et al., 13 Jun 2025)
TransMamba	Shared QKV↔CBx, TransPoint	Hybrid $O(L),O(L^2)$	Multitask/generalization +25% faster	(Li et al., 31 Mar 2025)
DiffuApriel (BiMamba)	Bidirectional SSM, diffusion	$O(L)$	4.4 $\times$ throughput, best PPL	(Singh et al., 19 Nov 2025)
BlackMamba	MoE-SSM hybrid	$O(L)$	+0.02–0.05 acc vs. baselines	(Anthony et al., 2024)
Sparse Mamba	Control-theoretic SSM forms	$O(L)$	-0.16% params, -5% PPL	(Hamdan et al., 2024)
Bi-Mamba	1-bit quantized SSM backbones	$O(L)$	85–90% mem save, ~5 pts acc loss	(Tang et al., 2024)

7. Limitations and Open Questions

Despite their efficiency and stable learning dynamics, Mamba models fundamentally exhibit exponential long-range memory decay unless modulated by input-selective or hybrid mechanisms. Attention-based models maintain more flexible, non-monotonic context dependencies. SSM-LMs can “forget” rare, numerical, or disfavored content, especially at longer sequence lengths—a challenge for numeracy, code, and open-domain dialogue (Hossain et al., 17 Dec 2025 Ma et al., 4 Sep 2025). Improvements may require novel tokenizations, specialized loss functions, memory-augmented self-supervision, or deeper hybridization with attention mechanisms.

In summary, the Mamba family establishes structured state-space models as a scalable, modifiable foundation for efficient and expressive large-scale LLMs, while ongoing research explores its theoretical boundaries, compositional mechanisms, and broader applicability.