XLSR-Mamba: Efficient Speech Modeling

Updated 22 February 2026

XLSR-Mamba model is a family of speech sequence models that combines multilingual self-supervised encoders (wav2vec 2.0 XLS-R) with state-space and Mamba-based backbones to capture long-range temporal dependencies.
It employs dual-column bidirectional processing and hybrid SSM-attention architectures, achieving superior spoofing detection performance with faster inference than traditional transformer models.
Its integration with multilingual pre-training, innovative data augmentation, and tailored loss functions enables efficient, robust speech processing across diverse applications.

XLSR-Mamba model refers to a family of speech sequence models that integrate a multilingual self-supervised encoder (XLSR, typically wav2vec 2.0 XLS-R) with state-space and Mamba-based backbones to efficiently capture long-range temporal dependencies for tasks such as spoofing attack and audio deepfake detection. The main instantiations include XLSR-Mamba—featuring a Dual-Column Bidirectional Mamba (BiMamba) state-space network—and the more recent XLSR-MamBo, a modular framework that hybridizes Mamba-style SSMs and attention mechanisms. Both serve as high-performance, computationally efficient alternatives to traditional transformer architectures in challenging speech-processing scenarios (Xiao et al., 2024, Ng et al., 6 Jan 2026, Miyazaki et al., 2024).

1. Architectural Foundation

The canonical XLSR-Mamba pipeline processes input speech in the following sequence:

Input: Raw waveform audio, typically sampled at 16 kHz.
XLSR (wav2vec 2.0 XLS-R) Feature Extraction: A convolutional front end maps the input to a $\tfrac{N}{320} \times C$ feature map ( $C=1024$ ), followed by 24 transformer encoder layers, yielding contextualized embeddings of size $T \times C$ .
Linear Projection: A fully connected projection reduces the embedding size from $\mathbb{R}^C$ to $\mathbb{R}^D$ (typical $D=144$ ), resulting in the sequence $X = [x_1, \dots, x_T]$ , $x_t \in \mathbb{R}^D$ .
BiMamba Backbone: The projected sequence is processed through $L=12$ stacked Bidirectional Mamba (BiMamba) layers, each implementing a dual-column structure to capture both forward and backward temporal dependencies.
Pooling & Classification: Mean pooling across temporal dimension, then a linear classifier predicts class labels (bonafide or spoof).
Data Augmentation: RawBoost is applied to input waveforms during training to enhance generalization (Xiao et al., 2024).

Recent variants such as XLSR-MamBo generalize this backbone by interleaving variants of SSMs (Mamba, Mamba2, Gated DeltaNet, and Hydra) with self-attention layers in multi-layered topologies (Ng et al., 6 Jan 2026).

2. Mathematical Formulation of Mamba and BiMamba

At its core, a Mamba block implements a time-varying, input-dependent linear state-space model (SSM) over the feature sequence:

$h_t = A_t h_{t-1} + B_t x_t,\quad y_t = C_t h_t + D_t x_t$

where $A_t$ , $B_t$ , $C_t$ , $D_t$ are dynamically generated (input-conditioned) matrices for each $t$ . The output for the full sequence can be computed as a time-varying convolution:

$y = x * \mathcal{K},\quad \mathcal{K} = [C_1B_1,\;C_2A_2B_2,\;\dots,\;C_TA_T\cdots A_2B_1]$

Bidirectionality is achieved in XLSR-Mamba using a dual-column structure:

The forward column processes $\{x_t\}$ sequentially.
The backward column processes time-reversed $\{x_{T-t+1}\}$ , then its outputs are reversed to match original time indices.
Outputs from both columns are concatenated at each time step ( $z_t = [h^f_t; h^b_t]$ ).
A projection brings $z_t$ back to $\mathbb{R}^D$ with residual connection (Xiao et al., 2024).

Hydra, as introduced in XLSR-MamBo, generalizes this to native bidirectional mixing within a single SSM block. The SSM is parameterized as a quasiseparable matrix yielding bidirectional context in $O(Td)$ time, circumventing the need for separate forward/backward passes (Ng et al., 6 Jan 2026).

3. Integration with Self-Supervised Learning and Multilingual Pre-Training

XLSR-Mamba leverages pre-trained wav2vec 2.0 XLS-R models (0.3B–0.3B parameters, multilingual) as its front end. During fine-tuning for spoofing detection, all network parameters (convolutional feature encoder, transformer layers, projection, Mamba or hybrid backbone, classifier) are updated end-to-end.

A general framework for extending Mamba with cross-lingual self-supervised objectives comprises:

Masked Frame Prediction: Mask a proportion of acoustic frames and predict original features using the output of Mamba blocks, with L1 or L2 loss.
Contrastive Loss: A projection head after specific Mamba layers predicts masked positions, with negatives sampled from other frames/tokens; minimized by InfoNCE:

$\mathcal{L}_\text{NCE} = -\sum_t \log\frac{\exp(h_t^\top\,c_t/\tau)}{\sum_{t'}\exp(h_t^\top\,c_{t'}/\tau)}$

Multilingual Parametrization: Prepend each utterance with a language embedding, which conditions the SSM selection layers ( $W_B, W_C, W_\Delta$ ). Relative positional bias can be added when necessary (Miyazaki et al., 2024).

XLSR-Mamba and its variants are typically pre-trained with large multilingual speech corpora (e.g., MLS, CommonVoice), and fine-tuned for downstream tasks without the self-supervised heads.

4. Training Regimes, Objective Functions, and Implementation

Key components of XLSR-Mamba empirical training setups include:

Objective: Weighted binary cross-entropy for binary spoofing classification. For instance,

$\mathcal{L}_{\mathrm{CE}} = -\frac{1}{N}\sum_{i=1}^N w_{y_i}\,\bigl[y_i\log p_i + (1-y_i)\log(1-p_i)\bigr]$

with $w_{0},w_{1}$ balancing class priors (Xiao et al., 2024).

Optimization: Adam (learning rate $1\times10^{-6}$ , weight decay $10^{-4}$ ), or AdamW for XLSR-MamBo (Ng et al., 6 Jan 2026). Early stopping is used (patience=7).
Augmentation: RawBoost includes linear/nonlinear convolutive, impulsive additive, and stationary additive noise.
Hyperparameters: Typical settings are $L=12$ BiMamba layers ( $d=144$ ), batch size 20 utterances, segment length 4 s (64600 samples). XLSR-MamBo reduces D ( $\mathbb{R}^{1024} \to \mathbb{R}^{128}$ ), with SSM state dim 64 and varying backbone depth $L \in \{5,7\}$ .
Evaluation metrics: Equal Error Rate (EER%) and minimum t-DCF are standard.

5. Comparative Empirical Performance

Empirical evaluation demonstrates XLSR-Mamba’s superiority in both accuracy and efficiency relative to transformer-based approaches:

System	Params	21LA EER%	21DF EER%	ITW EER%	min t-DCF
XLSR-Mamba (Xiao et al., 2024)	319 M	0.93	1.88	6.71	0.208
XLSR-Conformer+TCM	320 M	1.03	2.06	7.79	—
Fake-Mamba	320 M	0.97	1.74	5.85	—
MamBo-3-Hydra-N3 (Ng et al., 6 Jan 2026)	319 M	0.81	1.70	4.97	—

On In-the-Wild data, XLSR-Mamba outperforms XLSR+SLS and XLSR-Conformer+TCM baselines. XLSR-MamBo further reduces EER, especially in more diverse or unseen conditions (e.g., DFADD dataset) owing to stronger bidirectional context modeling via Hydra. Runtime analysis shows 30–40% faster inference (RTF ≈ 0.15 for XLSR-Mamba vs 0.25 for XLSR-Conformer) (Xiao et al., 2024, Ng et al., 6 Jan 2026).

6. Scaling Properties, Variants, and Generalization

XLSR-MamBo systematically explores SSM-attention hybridization and scaling effects:

Topologies: Interleaving SSM and attention layers (MamBo-3) with deeper stacking (N steps per Mamba block) improves both mean EER and variance across checkpoints.
SSM Variants: Hydra’s native bidirectional mixing outperforms dual-branch directional Mamba in both accuracy and computational efficiency.
Depth and Stacking: Higher backbone depth ( $L$ ) and greater SSM recurrence ( $N$ ) compress the spread in top-n checkpoint EERs, yielding more robust models under distribution shift.
Ablation: MamBo-1/2 (denser intra-block hybridization) and MamBo-4 (maximum alternation) show smaller or less stable improvements. Deeper models offset checkpoint instability.

In challenging scenarios (unseen synthetic attacks, such as diffusion or flow-matching synthesis), all MamBo variants generalize well, but bidirectional SSMs (notably Hydra) at higher depth achieve near-zero EERs on most subsets.

7. Broader Applications and Context

The Mamba architecture, when evaluated across tasks such as ASR, text-to-speech, spoken language understanding, and summarization, matches or outperforms state-of-the-art transformers. Its main strengths include linear-time complexity ( $O(NL)$ per block), the ability to process very long contexts, and scalability to long-form speech (e.g., 600 s video summarization out-of-memory for transformers) (Miyazaki et al., 2024). In speech applications where global context and efficient handling of long sequences are critical, the XLSR-Mamba line constitutes a class of highly efficient, robust, and generalizable neural architectures.

This suggests that XLSR-Mamba and its hybrids are positioned as competitive, computationally efficient backbones for multilingual and robust speech sequence modeling, especially in adversarial or out-of-distribution settings, and scale favorably with backbone depth and hybridization.