MoE-Mamba: Sparse SSM Hybrid Models

Updated 19 February 2026

MoE-Mamba is a family of hybrid architectures that fuses Mixture-of-Experts with state-space models for efficient long-context sequence processing.
These models leverage token-wise sparse gating and selective expert routing to scale capacity while reducing compute compared to traditional Transformers.
Variants integrate alternating SSM–MoE layers and Transformer blocks, delivering performance gains in language modeling, vision applications, and time series forecasting.

MoE-Mamba denotes a broad family of architectures that combine Mixture-of-Experts (MoE) sparse activation with Mamba-style state-space models (SSMs), either exclusively or in hybrid designs that also include Transformer blocks. These models exploit the linear-sequence modeling and long-context extrapolation of SSMs with the parameter and capacity scaling benefits of MoE, typically employing token-wise sparse gating to selectively activate a small subset of expert sub-networks per inference step. The term encompasses pure Mamba+MoE alternations, Mamba blocks with internal expertization, and broader hybrid stacks interleaving Mamba, Transformers, and MoE feedforward modules as found in Jamba/Jamba-1.5, Hunyuan-TurboS, Routing Mamba, AdaMamba, BlackMamba, HiFi-MambaV2, and various specialized applications in vision, medical imaging, and time series forecasting (Jeon, 7 Dec 2025, Pióro et al., 2024, Lieber et al., 2024, Team et al., 2024, Team et al., 21 May 2025, Zhan et al., 22 Jun 2025, Fang et al., 23 Nov 2025, Shabanpour et al., 9 Feb 2025, Xu et al., 29 Apr 2025, Wang et al., 9 Jun 2025, Anthony et al., 2024).

1. Core Architectural Principles

MoE-Mamba architectures are grounded in the fusion of SSMs—typically the hardware-aware, efficient Mamba variant—with sparse, token-level MoE routing. The paradigm is unified by several key technical elements:

Selective State-Space Modeling: Mamba blocks process sequences via discretized linear state-space recurrence, achieving $O(Ld^2)$ or better per-layer compute (for sequence length $L$ and embedding size $d$ ), in contrast to the $O(L^2d)$ cost of Transformers. For example, state updates follow $h_t = A h_{t-1} + B x_t$ , $y_t = C h_t + x_t$ (Xu et al., 29 Apr 2025, Pióro et al., 2024).
Sparse Expert Routing: At each MoE layer, token embeddings are routed—via top- $k$ softmax or variants such as Sinkhorn normalization—to one or a small number of expert networks. The per-token router is typically a learned linear projection $g(x) = \operatorname{softmax}(W_g x)$ , which outputs gating probabilities (Pióro et al., 2024, Lieber et al., 2024, Team et al., 21 May 2025).
Alternation or Integration: Architectures either alternate SSM and MoE layers (e.g., $x \rightarrow \text{Mamba} \rightarrow \widetilde x \rightarrow \text{MoE} \rightarrow x'$ ), or insert MoE layers within Mamba blocks by expertizing projections or the FFN sublayer (Jeon, 7 Dec 2025, Zhan et al., 22 Jun 2025, Pióro et al., 2024, Lieber et al., 2024).
Hybridization: MoE-Mamba emerges in several hybrid forms—most notably, Jamba and Hunyuan-TurboS interleave attention, SSM, and MoE sublayers in fixed ratios per block (e.g., 1:7 attention:Mamba in Jamba; 128 layers with varying AMF/MF patterns in Hunyuan-TurboS) (Lieber et al., 2024, Team et al., 2024, Team et al., 21 May 2025).
Load Balancing and Regularization: Auxiliary routing losses are used to balance expert utilization across tokens, commonly taking the form $\mathcal{L}_{\mathrm{load}} = \lambda n \sum_{i=1}^n \mathrm{Load}_i^2$ , where Load $_i$ is the expert activation fraction (Team et al., 2024, Pióro et al., 2024, Lieber et al., 2024, Fang et al., 23 Nov 2025).

2. Canonical MoE-Mamba Designs and Variants

The instantiations of MoE-Mamba span several design archetypes:

Alternating SSM–MoE Layers: As in MoE-Mamba (Pióro et al., 2024) and BlackMamba (Anthony et al., 2024), SSM and MoE blocks alternate with tokenwise sparse routing into banked FFN experts.
Hybrid Stacks (Transformer, SSM, MoE): Jamba/Jamba-1.5 (Lieber et al., 2024, Team et al., 2024) and Hunyuan-TurboS (Team et al., 21 May 2025) interleave attention, SSM, and MoE sublayers within each block, typically in a pattern such as Attention → Mamba → MoE-FFN, repeated; MoE layers replace every $e$ th FFN and use $n$ experts (with $K$ top-activated per token).
Projection Expertization (Routing Mamba): Routing Mamba (RoM) (Zhan et al., 22 Jun 2025) applies sparsely-activated experts directly to the main linear projections (Conv, Gate, Out) of Mamba blocks, using shared routing for all expertized layers—this maintains Mamba's linear complexity but scales parameter capacity.
Domain-Specific Variants:
- Patch-MoE (AdaMamba): Applied to multivariate time series; patch tokens are formed, then routed to MoE-augmented Mamba layers (Jeon, 7 Dec 2025).
- Frequency/Spatial Expertization (HiFi-MambaV2, MambaMoE for MRI/HSI): Features are decomposed (e.g., using Laplacian pyramids or multi-directional SSMs); routing is applied per-pixel or per-direction, balancing shared/routed experts for frequency or directional specialization (Fang et al., 23 Nov 2025, Xu et al., 29 Apr 2025).
- Multimodal/Restoration Models (M2Restore): MoE gating fuses CLIP-derived priors and trainable prompts to manage Mamba/CNN expert selection in weather-robust image restoration (Wang et al., 9 Jun 2025).

3. Mathematical Formulation of Routing and Expert Dispatch

The MoE dispatch mechanism in MoE-Mamba variants universally builds on the following mathematical outline:

Gating: For token $x \in \mathbb{R}^d$ , router logit vector $l(x) = W_g x \in \mathbb{R}^n$ (with $n$ experts), normalized via (optionally top- $k$ -masked) softmax:

$p_j(x) = \begin{cases} \dfrac{\exp(l_j(x))}{\sum_{k \in S(x)} \exp(l_k(x))}, & j \in S(x) \ 0, & \text{otherwise} \end{cases}$

where $S(x)$ is the set of top- $k$ indices for $x$ .

Expert Output: The MoE output is $\sum_{j \in S(x)} p_j(x) \cdot \text{FFN}_j(x)$ . Each expert is typically a two-layer MLP (e.g., GeLU or SwiGLU activation).
Load Balancing: Auxiliary losses penalize activation fraction deviations from uniform: $\mathcal{L}_{\mathrm{load}} = \lambda \sum_{e}( \bar p_e - \frac{1}{n} )^2$ .
Shared Routing and Expertization (Routing Mamba): Routing across multiple projections in a Mamba block (Conv, Gate, Out) uses the same $W_g$ mapping for all, tightly coupling expert paths (Zhan et al., 22 Jun 2025).

4. Computational Complexity and Scaling

MoE-Mamba variants are explicitly designed to achieve linear or near-linear per-token computational complexity and efficient scaling in parameter count:

Mamba Block: $O(d^2)$ per token for SSM expansion (with $E$ expansion factor), $O(E d^2)$ total (Pióro et al., 2024).
MoE Layer: For $k$ active experts per token, each of hidden size $d_e$ , MoE cost is $O(k d d_e)$ per token; only a small fraction of parameters are "active" per update.
Composite Model: MoE-Mamba keeps active compute close to dense Mamba and sub-linear compared to Transformer+MoE, enabling scaling to hundreds of millions or billions of total parameters (e.g., Routing Mamba supports 10B total parameters with 1.3B active per token, achieving a 23% FLOPS saving over dense Mamba for similar accuracy) (Zhan et al., 22 Jun 2025, Pióro et al., 2024, Fang et al., 23 Nov 2025).
Memory Footprint: Use of sparsity (top- $k$ routing), shared projections for minor components, and precision-reduced storage (e.g., ExpertsInt8 quantization in Jamba-1.5) ensures modest KV cache and active memory even at high context lengths (e.g., $<$ 10GB for Jamba-1.5-Large at 256K tokens) (Team et al., 2024).

5. Empirical Performance and Applications

MoE-Mamba models have been experimentally validated across a broad set of domains:

Long-Context Language Modeling: Jamba/Jamba-1.5 achieve up to 256K context windows with high throughput and state-of-the-art retrieval and QA metrics, outperforming leading Transformer and MoE hybrids, including on Infinite-Bench and RULER benchmarks (Lieber et al., 2024, Team et al., 2024).
Efficient Training and Inference: BlackMamba and MoE-Mamba demonstrate up to 2.35x speed-up (fewer training steps to reach parity performance vs. Mamba) and substantial FLOPs reduction vs. Transformers (Pióro et al., 2024, Anthony et al., 2024).
Time Series Forecasting: AdaMamba’s MoE-Mamba variant achieves stable, accurate multi-horizon forecasts under covariate and mean-shift, yielding a consistent 2–5% reduction in MSE on ETTm benchmarks vs. Transformer baselines (Jeon, 7 Dec 2025).
Vision and Medical Imaging: HiFi-MambaV2 surpasses prior CNN, Transformer, and Mamba models for high-fidelity MRI reconstruction (e.g., +0.37 dB PSNR, +0.037 SSIM on fastMRI) (Fang et al., 23 Nov 2025). MambaMoE exceeds SOTA for HSI classification (Xu et al., 29 Apr 2025). MoEMba improves robustness for EMG-based gesture recognition (Shabanpour et al., 9 Feb 2025). M2Restore sets new benchmarks in all-in-one weather image restoration (Wang et al., 9 Jun 2025).
Ablation Results: Across these works, removing or replacing the MoE component, or using standard FFNs, consistently degrades robustness to distributional shift, increases error (e.g., up to 10% MSE rise in AdaMamba, 3–4% OA in MambaMoE), and slows convergence (Jeon, 7 Dec 2025, Xu et al., 29 Apr 2025, Pióro et al., 2024).

6. Limitations and Open Problems

Several limitations are consistently observed or discussed:

Expert Underutilization: Too few experts (e.g., $M\leq2$ ) degrade performance below dense Mamba; expert collapse requires explicit auxiliary losses or architectural remedies (Pióro et al., 2024, Fang et al., 23 Nov 2025, Xu et al., 29 Apr 2025).
Implementation Complexity: Sparse routing and parameter sharding for large expert banks demand custom kernels, group-wise computation, and expert communication (e.g., as in BlackMamba and Routing Mamba) (Anthony et al., 2024, Zhan et al., 22 Jun 2025).
SSM Constraints: Fixed state compression in SSM can limit perfect in-context learning and token copying compared to attention mechanisms, motivating hybridization (Pióro et al., 2024, Lieber et al., 2024).
Domain Matching: Specialized expertization (e.g., per-projection in RoM, per-frequency in HiFi-MambaV2) is not always beneficial; naive application may degrade performance (see RoM ablations) (Zhan et al., 22 Jun 2025).
Scaling Laws and Routers: Theoretical scaling laws for optimal expert count vs. hidden/expansion sizes remain underinvestigated, as do richer router architectures beyond top- $k$ softmax or Sinkhorn (Pióro et al., 2024, Team et al., 21 May 2025).

7. Prospects and Future Directions

Recent trends and suggestions include:

Unified Hardware-Efficient Architectures: The hybridization of attention, SSM, and MoE (Jamba, Hunyuan-TurboS) provides a tunable spectrum between lean SSMs and high-capacity attention, supporting both efficiency and quality (Team et al., 2024, Team et al., 21 May 2025).
Expertization Beyond FFN: Routing Mamba demonstrates that expertization of linear projections inside SSMs can further scale capacity and preserve linear complexity; future models may deepen this integration (Zhan et al., 22 Jun 2025).
Advances in Routing and Quantization: Novel gating methods (e.g., CLIP-guided cross-modal routing in vision, ExpertsInt8 quantization for low-precision weights) enable specialization, robust expert usage, and inference efficiency (Team et al., 2024, Wang et al., 9 Jun 2025).
Task/Domain Adaptation: MoE-Mamba designs are being extended to adaptive, cross-modal, and frequency-/patch-/direction-aware expertization, with robust performance gains in non-stationary, heterogenous, or multi-modal environments (e.g., AdaMamba, HiFi-MambaV2) (Jeon, 7 Dec 2025, Fang et al., 23 Nov 2025, Xu et al., 29 Apr 2025, Shabanpour et al., 9 Feb 2025, Wang et al., 9 Jun 2025).
Theoretical Understanding: Open questions persist around scaling laws, fundamental expressivity boundaries, and optimization/regularization of sparse-expert SSMs versus equivalent dense parameterizations (Pióro et al., 2024, Zhan et al., 22 Jun 2025).

Together, these developments position MoE-Mamba and its variants as a central architectural principle for scale-adaptive, long-context sequence models, offering a flexible substrate for research and deployment across language, vision, and scientific modeling domains.