Self-Supervised Audio Mamba (SSAM)

Updated 3 February 2026

SSAM is a neural architecture that replaces Transformer attention with adaptive Mamba blocks for efficient, linear-time audio representation and spectrum reconstruction.
It leverages masked spectrogram modeling and pseudo-label prediction with self-supervised losses (MSE, InfoNCE, and cross-entropy) to learn robust audio features.
The framework achieves competitive performance on downstream tasks like ASR and speaker identification while offering scalable, parameter-efficient processing.

Self-Supervised Audio Mamba (SSAM) refers to a family of neural architectures for audio representation learning that replace the attention mechanisms found in Transformers with Selective Structured State Space Models (Mamba). SSAM models—and their variants such as SSAMBA, Mamba-HuBERT, and related frameworks—apply masked spectrogram modeling or masked acoustic prediction objectives over spectrogram patches or frame-level audio features. The core innovation is to achieve linear time and space complexity in sequence length by leveraging Mamba blocks, while maintaining or exceeding the representation and transfer quality of competing attention-based models across a range of audio and speech tasks. The approach is broadly applicable to general audio event recognition, speech enhancement, speaker identification, and automatic speech recognition (ASR), with growing empirical and theoretical validation for its efficacy and limitations (Zhang et al., 2024, Shams et al., 2024, Yadav et al., 2024, Yadav et al., 23 Sep 2025, Lin et al., 14 Jun 2025, Yuksel et al., 1 Jun 2025).

1. Core Mamba-Based Architecture for Audio

SSAM models are built on the recurrent, input-adaptive Selective Structured State Space (S4/Mamba) module, which generalizes standard SSMs by introducing context- and time-varying state transitions and output projections. The continuous formulation is

$\dot h(t) = A\, h(t) + B\, x(t), \qquad y(t) = C\, h(t) + D\, x(t),$

where $A$ , $B$ , $C$ , $D$ parameterize the linear dynamics. Discretization (via zero-order hold) produces

$\bar{A} = \exp(\Delta A),\quad \bar{B} = (\Delta A)^{-1}(\exp(\Delta A) - I)\, \Delta B,$

and the update

$h_t = \bar{A} h_{t-1} + \bar{B} x_t, \qquad y_t = C h_t.$

Mamba distinguishes itself by adapting $\bar{A}$ , $\bar{B}$ , and $C$ as functions of the input at each time step or patch position (e.g., $A$ 0, $A$ 1, $A$ 2). This gives each token its own SSM parameters, yielding both global and local temporal modeling capacity.

The architectural blueprint in SSAMBA and related models includes:

Audio front-end: 1D convolutional stack to convert raw waveform to dense frame-level or patch-level spectrogram features ( $A$ 3).
Tokenization: For non-speech, patches of the spectrogram are flattened and projected; for speech, frame-level features are used directly.
Mamba blocks: Stack of Mamba modules (unidirectional or bidirectional; often $A$ 4– $A$ 5).
Decoders/Heads: Small MLP for spectrogram patch regression, or a classifier for self-supervised prediction.

Unlike vision/NLP Mamba, the audio version uses convolutional encoders tied to frame rates, and the masking/reconstruction objective is always in the time-frequency domain, not pixels or tokens (Zhang et al., 2024, Shams et al., 2024, Yadav et al., 2024, Yadav et al., 23 Sep 2025).

2. Self-Supervised Training Objectives

The principal self-supervised objectives in SSAM models fall into two families:

a) Masked spectrogram patch modeling (SSAMBA, SSAM):

A subset $A$ 6 of $A$ 7 spectrogram patches is masked at random.
The model predicts every masked patch $A$ 8 from the encoded (masked) sequence.
The loss is mean squared error:

$A$ 9

SSAMBA extends this with an additional InfoNCE loss for masked patch discrimination:

$B$ 0

with total loss $B$ 1 ( $B$ 2).

b) Masked pseudo-label prediction (Mamba-HuBERT):

K-means clustering yields discrete units $B$ 3; masked frames are predicted by a classification head.
Cross-entropy loss:

$B$ 4

This yields a direct analogue of HuBERT(Zhang et al., 2024, Lin et al., 14 Jun 2025).

Masking is typically unstructured (randomly selected tokens/patches); mask ratio ranges from 15% (HuBERT) to 80% (real-world robust setups). Decoders are lightweight MLPs or local-attention stacks. No extra contrastive regularization is used in plain SSAM/SSAMBA(Yadav et al., 2024, Yadav et al., 23 Sep 2025).

3. Information-Theoretic Analysis

Empirical studies of information flow in Mamba-based speech models reveal distinctive mutual information (MI) trajectories:

Reconstruction tasks: $B$ 5 (MI between input and $B$ 6-th layer) drops in early layers (compression), then rises in later layers (reconstruction), mirroring the typical “information bottleneck” found in generative autoencoders.
Classification-style tasks: $B$ 7 decreases monotonically through the stack, even for deeper networks; this corresponds to loss of fine time-frequency detail needed for spectrum reconstruction.

The MI is estimated using the MINE estimator:

$B$ 8

Layerwise MI profiles therefore diagnose whether an architecture is predisposed to reconstructive or purely compressive/classification tasks.

Key finding: Mamba blocks by themselves are excellent for spectrum reconstruction but require additional classification or sequence modeling heads (e.g., Transformer/Conformer decoder) to excel in ASR (Zhang et al., 2024).

4. Experimental Protocols and Variants

Model Training and Implementation:

Dataset	Audio encoding	Masking	Loss/objective
AudioSet	Log-mel spec	50–80%	MSE (SSAM), or MSE+InfoNCE (SSAMBA)
LibriSpeech	log-mel + conv	15%	Cross-entropy (Mamba-HuBERT)

Patch/Frame Tokenization: Patches of $B$ 9, $C$ 0, etc., yielding 250–500 tokens per 2–5s audio.
Model size: “Tiny” (192d/12L), “Small” (384d/12L), “Base” (768d/12L); e.g., SSAM-Base, 12 blocks, 768d, $C$ 1M parameters.
Pretraining batch: 1024, AdamW, 100 epochs, linear warmup (10 epochs) then cosine decay.
Hardware: Multi-GPU clusters or A100 GPUs for large speech datasets.

HuBERT integration in Mamba-HuBERT:

7-layer conv encoder, 12 Mamba blocks
Mask 15% of frames, cluster unmasked layer outputs (L6/L9) to generate discrete pseudo-labels.

Evaluation:

Downstream tasks include AudioSet-20K, ESC-50, SpeechCommands, VoxCeleb-1, FSD50K, LibriCount, SUPERB probing (Shams et al., 2024, Yadav et al., 2024, Zhang et al., 2024).

5. Quantitative Results and Comparison

Downstream task performance (average normalized aggregate score $C$ 2):

Model	Params	$C$ 3 Tiny	$C$ 4 Base
SSAST	5.4M	55.3	68.2
SSAM (Mamba)	4.8M	75.0	87.9
AxLSTM	4.3M	70.1	85.8

On the HEAR and AudioSet/LibriSpeech benchmarks, SSAM/SSAMBA consistently outperforms Transformers at lower parameter count, with gains up to 30% relative on tiny models and $C$ 520% at base scale (Yadav et al., 23 Sep 2025, Yadav et al., 2024).
SSAMBA-tiny is 92.7% faster and 95.4% more memory-efficient than SSAST-tiny at $C$ 6K input tokens (Shams et al., 2024).
For ASR, bidirectional Mamba (ConBiMamba) with a decoder achieves competitive WER; for instance, ConBiMamba+decoder: test-clean 6.0%, test-other 17.2% (LibriSpeech100) (Zhang et al., 2024).
In pseudo-labeling/self-supervised speech recognition, Mamba-HuBERT approaches Transformer-based baselines, though final gap in WER remains unless a deep decoder is added (Zhang et al., 2024, Lin et al., 14 Jun 2025).
For real-world, noisy, and spatial scene understanding, SSAM’s performance with dry, non-spatial pretraining degrades compared to models trained with explicit spatial or noise augmentations (Yuksel et al., 1 Jun 2025).

6. Key Insights and Theoretical Implications

Linear Complexity: SSAM’s structured recurrence and input-adaptive gating deliver $C$ 7 time and space, in contrast to $C$ 8 attention in Transformers.
Flexible Context Modeling: Mamba state spaces adapt to local and global time–frequency structure, improving generalization to variable-length or high-resolution input.
Task Specialization: Standalone Mamba models excel at spectrum-reconstruction tasks due to their MI trajectories; classification/ASR tasks require downstream decoders to recover reconstruction capacity in intermediate representations (Zhang et al., 2024).
Robustness: While SSAM maintains high performance on dry audio, lack of exposure to spatial/reverberant data in pretraining causes a significant drop in spatial or naturalistic settings (Yuksel et al., 1 Jun 2025).
Bidirectionality: Bidirectional (layerwise alternation) Mamba substantially outperforms unidirectional SSMs for all downstream metrics.
Patch/Sequence Scaling: SSAM performance improves or saturates as number of patches or input duration increases, in contrast to Transformers which often degrade at large $C$ 9 (Yadav et al., 2024, Yadav et al., 23 Sep 2025).

7. Limitations, Applications, and Future Directions

Limitations:

SSAM and its variants do not inherently exploit structure in spatial or reverberant environments; spatial localization and robust performance require explicit augmentation or domain-adapted masking objectives(Yuksel et al., 1 Jun 2025).
Hardware and software ecosystems are less optimized for SSM-style selective scan than for batched attention.

Applications:

Real-time and edge-deployed audio classification, keyword spotting, and speaker ID, owing to linear inference time and low memory footprint.
Streaming and long-context ASR; document-level transcription without out-of-memory limitations (Lin et al., 14 Jun 2025, Zhang et al., 2024).
Self-supervised unit extraction for spoken language modeling and voice anonymization.

Future Directions:

Hybrid SSM/attention architectures to combine global context mixing with token-level adaptation.
Joint multi-task objectives mixing classification and reconstruction to “bake in” effective decoders at the encoder level.
Extension to spatial, noisy, and multi-channel domains via robust masking and spatial-contrastive loss formulations.
Analysis of information-theoretic flows under domain adaptation(Zhang et al., 2024, Yuksel et al., 1 Jun 2025).

In conclusion, Self-Supervised Audio Mamba architectures provide a scalable, parameter-efficient, and theoretically interpretable foundation for general-purpose learned audio representations, with ongoing research addressing their adaptation to real-world and domain-robust settings.