DuaBiMamba: Dual-Column Anti-Spoofing Model
- DuaBiMamba is a dual-column bidirectional architecture that uses parallel forward and backward Mamba blocks to efficiently model long speech sequences.
- It integrates self-supervised representations from systems like wav2vec 2.0, enhancing its ability to detect speech spoofing artifacts with low computational cost.
- Empirical studies show that DuaBiMamba achieves lower error rates and improved inference efficiency compared to traditional transformer and state-space models.
DuaBiMamba refers to the dual-column bidirectional Mamba block architecture, prominently introduced in XLSR-Mamba for the task of speech spoofing attack detection. DuaBiMamba replaces traditional transformer-based modules with parallel forward and backward Mamba stacks, operating on contextualized self-supervised representations to enhance both model efficiency and sequence modeling capacity. This approach is designed to process long speech sequences, leverage temporal symmetry, and exploit the expressive power of selective state space modeling. The DuaBiMamba architecture integrates with self-supervised pre-trained frontends, forming a high-performing but computationally lightweight system for discriminative speech analysis, as validated on major anti-spoofing benchmarks (Xiao et al., 2024).
1. Architectural Design and Dual-Column Framework
The DuaBiMamba backbone is comprised of two parallel columns of Mamba blocks (Column F for forward time and Column B for backward time), differing only in the direction of sequence processing. Each column consists of L stacked Mamba layers with tied structure and dimension. Input sequences are processed in their natural order by Column F and in time-reversed order by Column B. The outputs from Column F and the reverse-aligned outputs of Column B are concatenated along the channel axis and linearly projected before being passed to downstream tasks.
Within each Mamba block, the input features are first projected, then processed by a selective state space layer (SSM) with gating, followed by either a pointwise convolution or linear transformation to capture local dependencies. Residual connections and layer normalization ensure training stability. Fusion occurs after the dual columns, yielding rich temporal features that integrate information from both temporal directions.
Pseudocode for a single DuaBiMamba block:
1 2 3 4 5 6 7 |
def DuaBiMambaBlock(X): # X: (T, D) H_f = MambaStack_F(X) # Forward column X_rev = reverse_time(X) H_b_rev = MambaStack_B(X_rev) # Backward column H_b = reverse_time(H_b_rev) # Realign order H = concat(H_f, H_b) # Concatenate feature-wise return LayerNorm(X + Linear(H)) # Project and residual |
2. Mathematical Foundations of DuaBiMamba
Each Mamba block implements a continuous-time SSM of the form: where are parameterized by gating functions conditioned on , the input at time . Discretization at sampling interval produces: with
Bidirectionality is achieved by running the process twice: forward (natural order) and backward (time reversed). The outputs from both directions are fused after realignment, enabling sequence-to-sequence temporal modeling that merges information from context on both sides of each timestep.
3. Integration with Self-Supervised Frontends
DuaBiMamba is embedded within XLSR-Mamba’s anti-spoofing framework. The model takes as input representations from pre-trained wav2vec 2.0/XLSR, which provide rich subword acoustic features via a 7-layer CNN followed by 24 transformer layers. These representations are projected to dimension D and serve as input to L DuaBiMamba layers, yielding enhanced features for the classification head.
Self-supervised pretraining ensures that low-level acoustic modeling is addressed upstream, allowing the DuaBiMamba backend to specialize in detecting artifacts and discriminative patterns relevant for spoofing detection (Xiao et al., 2024).
4. Loss Functions and Supervision
The primary training objective is the weighted cross-entropy loss: where is the predicted probability and is the target. Regularization is applied via ℓ₂ weight decay (10⁻⁴). No auxiliary contrastive or multi-task losses are included in the standard XLSR-Mamba configuration.
5. Empirical Performance and Efficiency
DuaBiMamba achieves superior results on major anti-spoofing benchmarks:
| System | 2021 LA EER | 2021 LA t-DCF | 2021 DF EER | In-the-Wild EER |
|---|---|---|---|---|
| XLSR+DuaBiMamba | 0.93% | 0.208 | 1.88% | 6.71% |
| XLSR+Conformer | 0.97% | 0.212 | 2.58% | 8.42% |
| XLSR+AASIST | 1.00% | 0.212 | 3.69% | 10.46% |
| RawBMamba | 3.21% | 0.271 | 15.85% | – |
DuaBiMamba outperforms not only previous state-space and transformer-based backbones but also specialized models such as RawNet2 and SE-Rawformer. On In-the-Wild data, DuaBiMamba achieves the lowest generalization error, demonstrating robustness beyond the laboratory setting. Inference efficiency is also improved, with a 20–30% reduction in real-time factor over XLSR-Conformer.
6. Ablation Studies and Analytical Findings
Experiments reveal that:
- Removing bidirectionality increases the average equal error rate (EER) from 1.41% to 1.79%.
- Dual-column (DuaBiMamba) fusion outperforms inner or external bidirectional schemes, with an absolute gain of ≈0.4% EER on LA and a decrease of ≈0.36% in average EER.
- Decreasing model depth (L) or width (D) leads to systematic, albeit moderate, degradation in detection accuracy.
- All stacked components, including the dual columns and SSM blocks, contribute to state-of-the-art performance and computational scalability.
A plausible implication is that explicitly parallel, direction-specialized pathways capture sequence information complementary to what can be achieved with unidirectional or weight-shared bidirectional models. The concatenation and subsequent linear projection fuse distinct predictive cues from past and future context, leading to gains in robustness and artifact sensitivity (Xiao et al., 2024).
7. Context, Applications, and Significance
DuaBiMamba, as deployed in XLSR-Mamba, demonstrates effective modeling of long-range temporal dependencies in speech without incurring the quadratic cost associated with multi-head self-attention. Its competitive accuracy and efficiency are achieved through careful architectural integration of state-space models and self-supervised representations. The design is especially relevant for applications requiring real-time detection and generalization across unseen spoofing techniques.
The dual-column bidirectional approach is a marked departure from earlier bidirectional state-space or transformer designs, formalizing a flexible, efficient, and empirically validated method for sequential speech analysis. Its generalizability to other modalities or structured time-series data suggests potential for further cross-domain adoption, though this requires additional empirical validation.
In conclusion, DuaBiMamba embodies an architectural paradigm that synergistically combines dual temporal directionality, state-space modeling, and self-supervised frontends to achieve state-of-the-art performance in anti-spoofing detection and related sequential signal processing tasks (Xiao et al., 2024).