Papers
Topics
Authors
Recent
Search
2000 character limit reached

SincQDR-VAD: Noise-Robust, Efficient VAD

Updated 28 January 2026
  • The paper presents a learnable Sinc-filter front-end that replaces fixed filterbanks to extract robust, interpretable features directly from raw waveforms in noisy environments.
  • The paper introduces a quadratic disparity ranking loss that directly optimizes AUROC by enforcing a margin between speech and non-speech frames, enhancing detection performance.
  • The paper demonstrates that the compact SincQDR-VAD architecture, with only 8.0K parameters, outperforms existing lightweight VAD systems under challenging acoustic conditions.

SincQDR-VAD is a noise-robust voice activity detection (VAD) framework that integrates a learnable Sinc-extractor front-end and a quadratic disparity ranking (QDR) loss to optimize detection accuracy and efficiency, particularly in low-SNR and resource-constrained environments. SincQDR-VAD distinguishes itself by directly addressing the disconnect between frame-wise classification objectives and AUROC-based evaluation, while maintaining a compact architecture with significant parameter reduction compared to prior lightweight designs (Wang et al., 28 Aug 2025).

1. Sinc-Extractor Front-End with Learnable Filters

SincQDR-VAD replaces traditional fixed filterbanks or flat convolutional kernels with a bank of 64 learnable band-pass Sinc-filters applied directly to the raw input waveform. Each filter’s impulse response si[n]s_i[n] is defined as:

s~i[n]=ωc2iπsinc(ωc2in)ωc1iπsinc(ωc1in),<n<;\tilde{s}_i[n] = \frac{\omega_{c2}^i}{\pi}\,\text{sinc}(\omega_{c2}^i n) - \frac{\omega_{c1}^i}{\pi}\,\text{sinc}(\omega_{c1}^i n),\quad -\infty < n < \infty\text{;}

where ωc1i,ωc2i\omega_{c1}^i,\omega_{c2}^i are the learnable low/high cutoff frequencies. Truncation, centering, and Hamming windowing produce the final filter:

si[n]=bis~i[nR]h[n],0n<Ls_i[n] = b_i\,\tilde{s}_i[n-R]\,h[n] ,\quad 0\leq n < L

for filter length L=2R+1L = 2R+1, with bib_i a learnable gain and h[n]h[n] the Hamming window. For each frame tt, the log-energy per sub-band ii is:

x^t,i=log ⁣(nxt[n]si[n]2)\hat{x}_{t,i} = \log\!\Bigl(\sum_n \bigl|x_t[n] * s_i[n]\bigr|^2\Bigr)

All filter parameters (ωc1i\omega_{c1}^i, ωc2i\omega_{c2}^i, bib_i) are updated via backpropagation alongside the rest of the network. This front-end provides robust, interpretable feature extraction adapted for speech detection in noisy conditions.

2. Quadratic Disparity Ranking Loss for AUROC Optimization

To address the inadequacy of frame-wise binary cross-entropy (BCE) for directly optimizing AUROC—a pairwise metric—SincQDR-VAD introduces the QDR loss. For positive (speech) frames P\mathcal{P}, negative (non-speech) frames N\mathcal{N}, output scores y^i\hat{y}_i and margin m=1.0m=1.0:

LQDR=1PNiPjN[max(0,m(y^iy^j))]2\mathcal{L}_{\text{QDR}} = \frac{1}{|\mathcal{P}|\,|\mathcal{N}|} \sum_{i\in\mathcal{P}}\sum_{j\in\mathcal{N}} [\max(0, m - (\hat{y}_i - \hat{y}_j))]^2

This loss encourages speech frame scores to exceed non-speech frame scores by at least the margin, closely aligning the optimization target with AUROC. The overall training loss is a hybrid of QDR and BCE, weighted by λ=0.25\lambda=0.25:

LTotal=λLQDR+(1λ)LBCE\mathcal{L}_{\text{Total}} = \lambda\,\mathcal{L}_{\text{QDR}} + (1-\lambda)\,\mathcal{L}_{\text{BCE}}

This dual-objective preserves score calibration while promoting ranking separability.

3. Network Architecture and Parameter Efficiency

The Sinc-extractor outputs a T×64T\times64 time-frequency map, processed by:

  • An 8×88\times8 patchify layer (non-overlapping conv) to produce local spatio-temporal blocks.
  • Three encoder layers, each with a split-transform-merge design:
    • One branch: 3×33\times3 depthwise convolutions and grouped 1×11\times1 convolutions (group size 8).
    • Second branch: identity skip connection.
    • Concatenation and residual connection fuse the outputs.
  • Global average pooling, a linear layer, and a sigmoid produce per-frame speech probabilities.

The model contains only 8.0 K parameters—69% of TinyVAD (11.6 K), and far less than MarbleNet (88.9 K) or ResNet/BiLSTM-based VADs (hundreds of thousands). The low parameter count arises from the Sinc front-end (3 learnable parameters per filter) and extensive use of grouped/depthwise convolutions.

4. Training Procedure and Data Augmentation

Training uses the SCF dataset, consisting of Speech Commands V2 clips mixed with 2,800 environmental noise clips (from Freesound), and labels speech within the central 0.2–0.83 s of each 1 s clip as positive. The split is 80% train, 10% validation, 10% test.

Key procedures:

  • Augmentation: random time shifts (±\pm5 ms, 80% probability) and additive white noise (from 90-90 dB to 46-46 dB SNR).
  • Windowing: 25 ms frame, 10 ms hop, at 16 kHz.
  • Optimization: SGD with momentum 0.9, weight decay 10310^{-3}, batch size 256, 150 epochs, warmup \rightarrow constant \rightarrow polynomial decay schedule.
  • Regularization: Hamming window in Sinc filters acts as a smoothing constraint.

All parameters, including the Sinc-extractor, are optimized end-to-end.

5. Experimental Results and Ablation Studies

SincQDR-VAD’s performance was benchmarked on AVA-Speech, noisy AVA-Speech (adding ESC-50 noise at 10, 5, 0, 5-5, 10-10 dB SNR), and ACAM datasets. Primary metrics are AUROC and F2F_2-Score (threshold 0.5).

Performance summary:

Model Params AVA AUROC/F2F_2 Noisy AVA Avg AUROC ACAM AUROC/F2F_2
SincQDR-VAD 8.0 K 0.914/0.911 0.815 0.97/0.92
TinyVAD 11.6 K 0.864/0.645 0.799 0.96/0.65
MarbleNet 88.9 K 0.858/0.635 0.747

At 10-10 dB SNR (noisy AVA), SincQDR-VAD achieves AUROC 0.709 vs. TinyVAD’s 0.691 and MarbleNet’s 0.620.

Ablation studies demonstrate:

  • Without Sinc-extractor: AUROC drops from 0.914 to 0.889 on AVA and 0.815 to 0.784 on noisy AVA; F2F_2 drops from 0.911 to 0.881.
  • Without QDR loss: AUROC drops from 0.914 to 0.872 on AVA and 0.815 to 0.739 on noisy AVA; F2F_2 drops to 0.883.

These results establish that both the learnable Sinc front-end and the ranking-aware loss are critical contributors to noise robustness and overall detection performance.

6. Computational Considerations and Deployment

SincQDR-VAD’s efficiency stems from its front-end and modular lightweight layers. Each Sinc-filter involves three learnable parameters, while grouped and depthwise convolutions minimize redundancy. With modest filter lengths (e.g., 251 taps) and shallow architecture, the model supports highly parallel inference on CPUs or low-power DSPs. Postprocessing applies an 87.5% overlap median filter for output smoothing with negligible latency, facilitating real-time, on-device deployment.

7. Significance and Practical Implications

SincQDR-VAD demonstrates that integrating task-adaptive, interpretable feature extraction with a pairwise ranking-oriented loss yields substantial gains in VAD robustness under challenging noise conditions while maintaining extreme parameter efficiency (Wang et al., 28 Aug 2025). The direct coupling of training objective and evaluation metric via QDR loss underscores a methodological direction emphasizing metric-aligned optimization. The combination of end-to-end learnable Sinc-filterbanks and computational parsimony positions SincQDR-VAD as a leading candidate for deployment in real-time, low-power, and harsh-acoustic environments, extending the operational capacity of VAD-centric speech technologies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SincQDR-VAD.