Papers
Topics
Authors
Recent
Search
2000 character limit reached

XLSR-Mamba: Efficient Speech Modeling

Updated 22 February 2026
  • XLSR-Mamba model is a family of speech sequence models that combines multilingual self-supervised encoders (wav2vec 2.0 XLS-R) with state-space and Mamba-based backbones to capture long-range temporal dependencies.
  • It employs dual-column bidirectional processing and hybrid SSM-attention architectures, achieving superior spoofing detection performance with faster inference than traditional transformer models.
  • Its integration with multilingual pre-training, innovative data augmentation, and tailored loss functions enables efficient, robust speech processing across diverse applications.

XLSR-Mamba model refers to a family of speech sequence models that integrate a multilingual self-supervised encoder (XLSR, typically wav2vec 2.0 XLS-R) with state-space and Mamba-based backbones to efficiently capture long-range temporal dependencies for tasks such as spoofing attack and audio deepfake detection. The main instantiations include XLSR-Mamba—featuring a Dual-Column Bidirectional Mamba (BiMamba) state-space network—and the more recent XLSR-MamBo, a modular framework that hybridizes Mamba-style SSMs and attention mechanisms. Both serve as high-performance, computationally efficient alternatives to traditional transformer architectures in challenging speech-processing scenarios (Xiao et al., 2024, Ng et al., 6 Jan 2026, Miyazaki et al., 2024).

1. Architectural Foundation

The canonical XLSR-Mamba pipeline processes input speech in the following sequence:

  1. Input: Raw waveform audio, typically sampled at 16 kHz.
  2. XLSR (wav2vec 2.0 XLS-R) Feature Extraction: A convolutional front end maps the input to a N320×C\tfrac{N}{320} \times C feature map (C=1024C=1024), followed by 24 transformer encoder layers, yielding contextualized embeddings of size T×CT \times C.
  3. Linear Projection: A fully connected projection reduces the embedding size from RC\mathbb{R}^C to RD\mathbb{R}^D (typical D=144D=144), resulting in the sequence X=[x1,,xT]X = [x_1, \dots, x_T], xtRDx_t \in \mathbb{R}^D.
  4. BiMamba Backbone: The projected sequence is processed through L=12L=12 stacked Bidirectional Mamba (BiMamba) layers, each implementing a dual-column structure to capture both forward and backward temporal dependencies.
  5. Pooling & Classification: Mean pooling across temporal dimension, then a linear classifier predicts class labels (bonafide or spoof).
  6. Data Augmentation: RawBoost is applied to input waveforms during training to enhance generalization (Xiao et al., 2024).

Recent variants such as XLSR-MamBo generalize this backbone by interleaving variants of SSMs (Mamba, Mamba2, Gated DeltaNet, and Hydra) with self-attention layers in multi-layered topologies (Ng et al., 6 Jan 2026).

2. Mathematical Formulation of Mamba and BiMamba

At its core, a Mamba block implements a time-varying, input-dependent linear state-space model (SSM) over the feature sequence:

ht=Atht1+Btxt,yt=Ctht+Dtxth_t = A_t h_{t-1} + B_t x_t,\quad y_t = C_t h_t + D_t x_t

where AtA_t, BtB_t, CtC_t, DtD_t are dynamically generated (input-conditioned) matrices for each tt. The output for the full sequence can be computed as a time-varying convolution:

y=xK,K=[C1B1,  C2A2B2,  ,  CTATA2B1]y = x * \mathcal{K},\quad \mathcal{K} = [C_1B_1,\;C_2A_2B_2,\;\dots,\;C_TA_T\cdots A_2B_1]

Bidirectionality is achieved in XLSR-Mamba using a dual-column structure:

  • The forward column processes {xt}\{x_t\} sequentially.
  • The backward column processes time-reversed {xTt+1}\{x_{T-t+1}\}, then its outputs are reversed to match original time indices.
  • Outputs from both columns are concatenated at each time step (zt=[htf;htb]z_t = [h^f_t; h^b_t]).
  • A projection brings ztz_t back to RD\mathbb{R}^D with residual connection (Xiao et al., 2024).

Hydra, as introduced in XLSR-MamBo, generalizes this to native bidirectional mixing within a single SSM block. The SSM is parameterized as a quasiseparable matrix yielding bidirectional context in O(Td)O(Td) time, circumventing the need for separate forward/backward passes (Ng et al., 6 Jan 2026).

3. Integration with Self-Supervised Learning and Multilingual Pre-Training

XLSR-Mamba leverages pre-trained wav2vec 2.0 XLS-R models (0.3B–0.3B parameters, multilingual) as its front end. During fine-tuning for spoofing detection, all network parameters (convolutional feature encoder, transformer layers, projection, Mamba or hybrid backbone, classifier) are updated end-to-end.

A general framework for extending Mamba with cross-lingual self-supervised objectives comprises:

  • Masked Frame Prediction: Mask a proportion of acoustic frames and predict original features using the output of Mamba blocks, with L1 or L2 loss.
  • Contrastive Loss: A projection head after specific Mamba layers predicts masked positions, with negatives sampled from other frames/tokens; minimized by InfoNCE:

LNCE=tlogexp(htct/τ)texp(htct/τ)\mathcal{L}_\text{NCE} = -\sum_t \log\frac{\exp(h_t^\top\,c_t/\tau)}{\sum_{t'}\exp(h_t^\top\,c_{t'}/\tau)}

  • Multilingual Parametrization: Prepend each utterance with a language embedding, which conditions the SSM selection layers (WB,WC,WΔW_B, W_C, W_\Delta). Relative positional bias can be added when necessary (Miyazaki et al., 2024).

XLSR-Mamba and its variants are typically pre-trained with large multilingual speech corpora (e.g., MLS, CommonVoice), and fine-tuned for downstream tasks without the self-supervised heads.

4. Training Regimes, Objective Functions, and Implementation

Key components of XLSR-Mamba empirical training setups include:

  • Objective: Weighted binary cross-entropy for binary spoofing classification. For instance,

LCE=1Ni=1Nwyi[yilogpi+(1yi)log(1pi)]\mathcal{L}_{\mathrm{CE}} = -\frac{1}{N}\sum_{i=1}^N w_{y_i}\,\bigl[y_i\log p_i + (1-y_i)\log(1-p_i)\bigr]

with w0,w1w_{0},w_{1} balancing class priors (Xiao et al., 2024).

  • Optimization: Adam (learning rate 1×1061\times10^{-6}, weight decay 10410^{-4}), or AdamW for XLSR-MamBo (Ng et al., 6 Jan 2026). Early stopping is used (patience=7).
  • Augmentation: RawBoost includes linear/nonlinear convolutive, impulsive additive, and stationary additive noise.
  • Hyperparameters: Typical settings are L=12L=12 BiMamba layers (d=144d=144), batch size 20 utterances, segment length 4 s (64600 samples). XLSR-MamBo reduces D (R1024R128\mathbb{R}^{1024} \to \mathbb{R}^{128}), with SSM state dim 64 and varying backbone depth L{5,7}L \in \{5,7\}.
  • Evaluation metrics: Equal Error Rate (EER%) and minimum t-DCF are standard.

5. Comparative Empirical Performance

Empirical evaluation demonstrates XLSR-Mamba’s superiority in both accuracy and efficiency relative to transformer-based approaches:

System Params 21LA EER% 21DF EER% ITW EER% min t-DCF
XLSR-Mamba (Xiao et al., 2024) 319 M 0.93 1.88 6.71 0.208
XLSR-Conformer+TCM 320 M 1.03 2.06 7.79
Fake-Mamba 320 M 0.97 1.74 5.85
MamBo-3-Hydra-N3 (Ng et al., 6 Jan 2026) 319 M 0.81 1.70 4.97

On In-the-Wild data, XLSR-Mamba outperforms XLSR+SLS and XLSR-Conformer+TCM baselines. XLSR-MamBo further reduces EER, especially in more diverse or unseen conditions (e.g., DFADD dataset) owing to stronger bidirectional context modeling via Hydra. Runtime analysis shows 30–40% faster inference (RTF ≈ 0.15 for XLSR-Mamba vs 0.25 for XLSR-Conformer) (Xiao et al., 2024, Ng et al., 6 Jan 2026).

6. Scaling Properties, Variants, and Generalization

XLSR-MamBo systematically explores SSM-attention hybridization and scaling effects:

  • Topologies: Interleaving SSM and attention layers (MamBo-3) with deeper stacking (N steps per Mamba block) improves both mean EER and variance across checkpoints.
  • SSM Variants: Hydra’s native bidirectional mixing outperforms dual-branch directional Mamba in both accuracy and computational efficiency.
  • Depth and Stacking: Higher backbone depth (LL) and greater SSM recurrence (NN) compress the spread in top-n checkpoint EERs, yielding more robust models under distribution shift.
  • Ablation: MamBo-1/2 (denser intra-block hybridization) and MamBo-4 (maximum alternation) show smaller or less stable improvements. Deeper models offset checkpoint instability.

In challenging scenarios (unseen synthetic attacks, such as diffusion or flow-matching synthesis), all MamBo variants generalize well, but bidirectional SSMs (notably Hydra) at higher depth achieve near-zero EERs on most subsets.

7. Broader Applications and Context

The Mamba architecture, when evaluated across tasks such as ASR, text-to-speech, spoken language understanding, and summarization, matches or outperforms state-of-the-art transformers. Its main strengths include linear-time complexity (O(NL)O(NL) per block), the ability to process very long contexts, and scalability to long-form speech (e.g., 600 s video summarization out-of-memory for transformers) (Miyazaki et al., 2024). In speech applications where global context and efficient handling of long sequences are critical, the XLSR-Mamba line constitutes a class of highly efficient, robust, and generalizable neural architectures.

This suggests that XLSR-Mamba and its hybrids are positioned as competitive, computationally efficient backbones for multilingual and robust speech sequence modeling, especially in adversarial or out-of-distribution settings, and scale favorably with backbone depth and hybridization.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to XLSR-Mamba Model.