Papers
Topics
Authors
Recent
Search
2000 character limit reached

MambaMixer Architecture: Dual Selective SSMs

Updated 21 February 2026
  • MambaMixer is a deep network architecture that leverages selective state-space models for dual token and channel mixing, enabling scalable long-sequence modeling.
  • It integrates dense skip connections and data-dependent gating to efficiently combine features and stabilize training across multiple layers.
  • Variants such as ViM2, TSM2, and MoE-enhanced models demonstrate state-of-the-art performance in vision, time-series forecasting, language modeling, and decision-making.

MambaMixer architectures are a class of deep networks based on Selective State Space Models (SSMs) that integrate flexible token- and channel-wise information mixing through data-dependent gates and dual state-space layers. These architectures have achieved state-of-the-art results in long-sequence modeling tasks across vision, time-series forecasting, language modeling, and decision-making, providing a scalable and hardware-efficient alternative to transformer-based models. Variants such as Vision MambaMixer (ViM2), Time Series MambaMixer (TSM2), and hybrid models like Jamba and BlackMamba further diversify its application landscape, benefiting from linear time and space complexity, dense skip mechanisms, and mixture-of-experts (MoE) enhancements (Behrouz et al., 2024, Anthony et al., 2024, Lieber et al., 2024, Olalde-Verano et al., 2024, Kim, 2024).

1. Core Principles of the MambaMixer Block

At the core of the MambaMixer architecture is the dual selective mixing of information across the sequence (token) and feature (channel) axes using SSMs with data-dependent parameters. Each block comprises:

  • Selective Token Mixer (STM): Applies a selective SSM—where BtB_t, CtC_t, and Δt\Delta_t are learned per-token via input-dependent linear projections—to the input sequence. An SSM recurrence is used:

ht=Aˉht−1+Btxt,yt=Cthth_t = \bar{A} h_{t-1} + B_t x_t, \quad y_t = C_t h_t

Gating and MLP-based preprocessing generate token–specific features, while convolutional gating selects which tokens to emphasize.

  • Selective Channel Mixer (SCM): Operates across feature channels by transposing the input and applying a bidirectional SSM (forward and reversed channel axes), allowing for intra-channel interactions. Each channel position dd receives:

zdfwd=SSMAˉ(sigmoid(Conv1D(Wlin′x⊤)))z^{\text{fwd}}_d = \text{SSM}_{\bar{A}}\left(\text{sigmoid}(\text{Conv}_{1D}(W'_\text{lin} x^\top))\right)

with similar processing for the backward direction. MLP-based gating modulates these outputs.

  • Dense Weighted Averaging (Skip Connections): Each block aggregates outputs from all previous blocks via learned scalar weights, enabling direct access to early representations and facilitating gradient flow:

xTokenℓ=∑i<ℓαℓ,iyToken(i)+∑i<ℓβℓ,iyChannel(i)x^\ell_\text{Token} = \sum_{i<\ell} \alpha_{\ell,i} y^{(i)}_\text{Token} + \sum_{i<\ell} \beta_{\ell,i} y^{(i)}_\text{Channel}

xChannelℓ=∑i≤ℓθℓ,iyToken(i)+∑i<ℓγℓ,iyChannel(i)x^\ell_\text{Channel} = \sum_{i\leq\ell} \theta_{\ell,i} y^{(i)}_\text{Token} + \sum_{i<\ell} \gamma_{\ell,i} y^{(i)}_\text{Channel}

This dual selective mechanism is realized fully attention-free, relying on SSMs and MLPs for all information routing (Behrouz et al., 2024).

2. Mathematical and Algorithmic Underpinnings

The SSM backbone in MambaMixer generalizes continuous-time linear systems:

hË™(t)=Ah(t)+Bx(t),y(t)=Ch(t)+Dx(t)\dot{h}(t) = A h(t) + B x(t), \qquad y(t) = C h(t) + D x(t)

Discretized via zero-order hold (ZOH) for computational efficiency:

Aˉ=eΔA,Bˉ=A−1(Aˉ−I)B\bar{A} = e^{\Delta A}, \qquad \bar{B} = A^{-1}(\bar{A} - I) B

with ht=Aˉht−1+Bˉxth_t = \bar{A} h_{t-1} + \bar{B} x_t for the time-invariant core. In selective versions, BtB_t and CtC_t are direct functions of xtx_t, enabling content-aware updates.

State-space convolutions support linear-complexity sequence processing via hardware-optimized parallel scan, yielding O(Tn2)O(T n^2) training time and O(1)O(1) per-step inference (Kim, 2024).

Gating and MLP selection enable both token and channel selection:

  • Sequence dimension: 1D or diagonal convolutional gating for tokens (STM)
  • Feature dimension: 1D convolutional gating for channels (SCM) Elementwise combination with MLP features yields expressivity akin to attention without quadratic scaling.

3. Architectural Instantiations and Application Domains

Vision MambaMixer (ViM2)

  • Patchified input with 4×4 convolutional stem.
  • STM performs four diagonal 1D SSM scans per patch grid, outputs summed.
  • Four-stage hierarchy with progressive spatial downsampling.
  • Achieves 82.7%/83.7%/83.9% Top-1 ImageNet-1K accuracy with 20M/43M/74M parameters; matches or outperforms Swin-S and VMamba-S at similar scale (Behrouz et al., 2024).

Time Series MambaMixer (TSM2)

  • Input consists of historical series and optionally auxiliary/static features.
  • STM along time, SCM across variates.
  • 2D normalization across time/feature grid.
  • Surpasses recent Transformer and MLPMixer baselines on 8/8 multivariate forecasting datasets in 26/32 settings (e.g., ETTh1, Electricity) (Behrouz et al., 2024).

SambaMixer

  • Application to multivariate sensor data, e.g. Li-ion battery SOH prediction.
  • Pipeline includes anchor-based resampling, dual positional encodings, stacked MambaMixer blocks, and mean-pool regression head.
  • Demonstrates MAE improvements: 1.07 (anchor), 1.27 (linear), 3.32 (random), and increased robustness to cycle-shifted scenarios (Olalde-Verano et al., 2024).

Decision MetaMamba (DMM)

  • In offline RL, inputs for each step are embedded per modality (state, action, return-to-go), then merged via modality-specific token mixers (e.g., causal 1D convolution or linear flatten/projection).
  • Local mixing window (typically w=6w=6) replaces traditional positional encoding.
  • Stacked Mamba SSM blocks with residual-multiplicative gating.
  • DMM variants achieve normalized returns up to 96.2 (linear mixer) and 94.2 (conv mixer) on Hopper-medium, outperforming parameter-matched Decision Transformers while using a fraction of the parameters (Kim, 2024).

Hybrid and MoE Variants

  • BlackMamba alternates Mamba SSM layers with MoE blocks (top-1 routing, Sinkhorn-balanced), reducing inference FLOPs by 20–40% over dense equivalents (Anthony et al., 2024).
  • Jamba interleaves a small number of Transformer layers with predominantly Mamba SSM layers and sparse MoE FFNs. Configurations such as 12B active/52B total parameters enable 256K token contexts with 3× higher throughput than Transformer-only LMs, with better or equal performance on standard NLP and long-context benchmarks (Lieber et al., 2024).

4. Information Flow, Dense Skip Connections, and Training Stability

A signature feature of MambaMixer architectures is the dense weighted averaging of all previous STM and SCM outputs in constructing block inputs. This mechanism fosters stable deep network training (analogous to DenseNet's dense connectivity), mitigates vanishing gradient issues, and allows for direct layerwise interaction:

  • Four learnable matrices per layer control the contribution of each prior token and channel outcome, facilitating both long-range dependency modeling and fine-grained local selection (Behrouz et al., 2024, Olalde-Verano et al., 2024).
  • Empirical ablations show that removing these skip connections or collapsing token/channel separation markedly reduces performance (e.g., >>10–20 pt degradation in normalized return for Decision MetaMamba) (Kim, 2024).

5. Computational Complexity and Efficiency

MambaMixer achieves linear complexity in both time and space with respect to sequence length and channel dimension. For a single block:

  • STM: O(Bâ‹…Lâ‹…E+Eâ‹…N)O(B \cdot L \cdot E + E \cdot N) (B=batch, L=sequence, E=gating dim, N=state size)
  • SCM: O(Bâ‹…Dâ‹…E+Eâ‹…L)O(B \cdot D \cdot E + E \cdot L) (D=channels)
  • Skip connections: O(L2)O(L^2) for weight storage (negligible for moderate LL)
  • Transformer-based methods incur O(BL2D)O(B L^2 D) time and O(L2D)O(L^2 D) space per block, highlighting the efficiency advantages in long sequence regimes (Behrouz et al., 2024).

Inference is recurrence-based with O(1)O(1) memory, leveraging the rolling SSM state; no key-value cache is needed, further reducing memory footprint for long-context applications (Anthony et al., 2024, Lieber et al., 2024).

MoE-augmented variants sparsely activate experts per token, bounding active parameter utilization and latency, as evidenced by Jamba (12B active, state-of-the-art benchmarking) and BlackMamba (up to 2.8B total, 20–40% FLOPs savings) (Lieber et al., 2024, Anthony et al., 2024).

6. Empirical Performance and Comparative Analysis

Across multiple domains, MambaMixer and its descendants attain or exceed state-of-the-art results:

  • Vision: ViM2 matches or outperforms Swin-T/S/B and VMamba-S/B in classification and segmentation tasks at comparable or lower parameter counts (Behrouz et al., 2024).
  • Time Series: TSM2 outperforms Informer, Autoformer, and MLP-based TSMixer across benchmarks (ETTh1, Electricity, M5) with relative improvements up to 7.4% in WRMSSE (Behrouz et al., 2024).
  • Battery Health: SambaMixer achieves 37% lower MAE over vanilla Mamba by combining dense and dual-mixer design (Olalde-Verano et al., 2024).
  • Language and RL: Jamba and BlackMamba demonstrate the scalability of hybrid and MoE-enhanced MambaMixers, with superior throughput and competitive or superior accuracy compared to Transformer baselines, especially at long context lengths (e.g., T=256T=256K), and strong few-shot and in-context learning (Lieber et al., 2024, Anthony et al., 2024, Kim, 2024).

7. Design Implications and Prospects

The MambaMixer framework establishes that dual selective mixing via SSMs—without explicit attention—can robustly model long-range and cross-channel dependencies with hardware-amenable complexity and network depth. Its generality is evidenced by its instantiations across modalities: images (ViM2), time series (TSM2), RL trajectories (Decision MetaMamba), and language (Jamba, BlackMamba). The empirical evidence refutes the necessity of MLP-based channel mixing or transformer-style cross-token attention for high performance in these domains (Behrouz et al., 2024, Kim, 2024).

A plausible implication is that future sequence models for dense, multi-dimensional data will increasingly rely on dynamic state-space parameterization, dual-axis mixers, and efficient skip connectivity, further extending the principles pioneered by the MambaMixer architecture.


References:

  • (Behrouz et al., 2024) "MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection"
  • (Anthony et al., 2024) "BlackMamba: Mixture of Experts for State-Space Models"
  • (Lieber et al., 2024) "Jamba: A Hybrid Transformer-Mamba LLM"
  • (Olalde-Verano et al., 2024) "SambaMixer: State of Health Prediction of Li-ion Batteries using Mamba State Space Models"
  • (Kim, 2024) "Integrating Multi-Modal Input Token Mixer Into Mamba-Based Decision Models: Decision MetaMamba"

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MambaMixer Architecture.