SincQDR-VAD: Noise-Robust, Efficient VAD

Updated 28 January 2026

The paper presents a learnable Sinc-filter front-end that replaces fixed filterbanks to extract robust, interpretable features directly from raw waveforms in noisy environments.
The paper introduces a quadratic disparity ranking loss that directly optimizes AUROC by enforcing a margin between speech and non-speech frames, enhancing detection performance.
The paper demonstrates that the compact SincQDR-VAD architecture, with only 8.0K parameters, outperforms existing lightweight VAD systems under challenging acoustic conditions.

SincQDR-VAD is a noise-robust voice activity detection (VAD) framework that integrates a learnable Sinc-extractor front-end and a quadratic disparity ranking (QDR) loss to optimize detection accuracy and efficiency, particularly in low-SNR and resource-constrained environments. SincQDR-VAD distinguishes itself by directly addressing the disconnect between frame-wise classification objectives and AUROC-based evaluation, while maintaining a compact architecture with significant parameter reduction compared to prior lightweight designs (Wang et al., 28 Aug 2025).

1. Sinc-Extractor Front-End with Learnable Filters

SincQDR-VAD replaces traditional fixed filterbanks or flat convolutional kernels with a bank of 64 learnable band-pass Sinc-filters applied directly to the raw input waveform. Each filter’s impulse response $s_i[n]$ is defined as:

$\tilde{s}_i[n] = \frac{\omega_{c2}^i}{\pi}\,\text{sinc}(\omega_{c2}^i n) - \frac{\omega_{c1}^i}{\pi}\,\text{sinc}(\omega_{c1}^i n),\quad -\infty < n < \infty\text{;}$

where $\omega_{c1}^i,\omega_{c2}^i$ are the learnable low/high cutoff frequencies. Truncation, centering, and Hamming windowing produce the final filter:

$s_i[n] = b_i\,\tilde{s}_i[n-R]\,h[n] ,\quad 0\leq n < L$

for filter length $L = 2R+1$ , with $b_i$ a learnable gain and $h[n]$ the Hamming window. For each frame $t$ , the log-energy per sub-band $i$ is:

$\hat{x}_{t,i} = \log\!\Bigl(\sum_n \bigl|x_t[n] * s_i[n]\bigr|^2\Bigr)$

All filter parameters ( $\tilde{s}_i[n] = \frac{\omega_{c2}^i}{\pi}\,\text{sinc}(\omega_{c2}^i n) - \frac{\omega_{c1}^i}{\pi}\,\text{sinc}(\omega_{c1}^i n),\quad -\infty < n < \infty\text{;}$ 0, $\tilde{s}_i[n] = \frac{\omega_{c2}^i}{\pi}\,\text{sinc}(\omega_{c2}^i n) - \frac{\omega_{c1}^i}{\pi}\,\text{sinc}(\omega_{c1}^i n),\quad -\infty < n < \infty\text{;}$ 1, $\tilde{s}_i[n] = \frac{\omega_{c2}^i}{\pi}\,\text{sinc}(\omega_{c2}^i n) - \frac{\omega_{c1}^i}{\pi}\,\text{sinc}(\omega_{c1}^i n),\quad -\infty < n < \infty\text{;}$ 2) are updated via backpropagation alongside the rest of the network. This front-end provides robust, interpretable feature extraction adapted for speech detection in noisy conditions.

2. Quadratic Disparity Ranking Loss for AUROC Optimization

To address the inadequacy of frame-wise binary cross-entropy (BCE) for directly optimizing AUROC—a pairwise metric—SincQDR-VAD introduces the QDR loss. For positive (speech) frames $\tilde{s}_i[n] = \frac{\omega_{c2}^i}{\pi}\,\text{sinc}(\omega_{c2}^i n) - \frac{\omega_{c1}^i}{\pi}\,\text{sinc}(\omega_{c1}^i n),\quad -\infty < n < \infty\text{;}$ 3, negative (non-speech) frames $\tilde{s}_i[n] = \frac{\omega_{c2}^i}{\pi}\,\text{sinc}(\omega_{c2}^i n) - \frac{\omega_{c1}^i}{\pi}\,\text{sinc}(\omega_{c1}^i n),\quad -\infty < n < \infty\text{;}$ 4, output scores $\tilde{s}_i[n] = \frac{\omega_{c2}^i}{\pi}\,\text{sinc}(\omega_{c2}^i n) - \frac{\omega_{c1}^i}{\pi}\,\text{sinc}(\omega_{c1}^i n),\quad -\infty < n < \infty\text{;}$ 5 and margin $\tilde{s}_i[n] = \frac{\omega_{c2}^i}{\pi}\,\text{sinc}(\omega_{c2}^i n) - \frac{\omega_{c1}^i}{\pi}\,\text{sinc}(\omega_{c1}^i n),\quad -\infty < n < \infty\text{;}$ 6:

$\tilde{s}_i[n] = \frac{\omega_{c2}^i}{\pi}\,\text{sinc}(\omega_{c2}^i n) - \frac{\omega_{c1}^i}{\pi}\,\text{sinc}(\omega_{c1}^i n),\quad -\infty < n < \infty\text{;}$ 7

This loss encourages speech frame scores to exceed non-speech frame scores by at least the margin, closely aligning the optimization target with AUROC. The overall training loss is a hybrid of QDR and BCE, weighted by $\tilde{s}_i[n] = \frac{\omega_{c2}^i}{\pi}\,\text{sinc}(\omega_{c2}^i n) - \frac{\omega_{c1}^i}{\pi}\,\text{sinc}(\omega_{c1}^i n),\quad -\infty < n < \infty\text{;}$ 8:

$\tilde{s}_i[n] = \frac{\omega_{c2}^i}{\pi}\,\text{sinc}(\omega_{c2}^i n) - \frac{\omega_{c1}^i}{\pi}\,\text{sinc}(\omega_{c1}^i n),\quad -\infty < n < \infty\text{;}$ 9

This dual-objective preserves score calibration while promoting ranking separability.

3. Network Architecture and Parameter Efficiency

The Sinc-extractor outputs a $\omega_{c1}^i,\omega_{c2}^i$ 0 time-frequency map, processed by:

An $\omega_{c1}^i,\omega_{c2}^i$ 1 patchify layer (non-overlapping conv) to produce local spatio-temporal blocks.
Three encoder layers, each with a split-transform-merge design:
- One branch: $\omega_{c1}^i,\omega_{c2}^i$ 2 depthwise convolutions and grouped $\omega_{c1}^i,\omega_{c2}^i$ 3 convolutions (group size 8).
- Second branch: identity skip connection.
- Concatenation and residual connection fuse the outputs.
Global average pooling, a linear layer, and a sigmoid produce per-frame speech probabilities.

The model contains only 8.0 K parameters—69% of TinyVAD (11.6 K), and far less than MarbleNet (88.9 K) or ResNet/BiLSTM-based VADs (hundreds of thousands). The low parameter count arises from the Sinc front-end (3 learnable parameters per filter) and extensive use of grouped/depthwise convolutions.

4. Training Procedure and Data Augmentation

Training uses the SCF dataset, consisting of Speech Commands V2 clips mixed with 2,800 environmental noise clips (from Freesound), and labels speech within the central 0.2–0.83 s of each 1 s clip as positive. The split is 80% train, 10% validation, 10% test.

Key procedures:

Augmentation: random time shifts ( $\omega_{c1}^i,\omega_{c2}^i$ 45 ms, 80% probability) and additive white noise (from $\omega_{c1}^i,\omega_{c2}^i$ 5 dB to $\omega_{c1}^i,\omega_{c2}^i$ 6 dB SNR).
Windowing: 25 ms frame, 10 ms hop, at 16 kHz.
Optimization: SGD with momentum 0.9, weight decay $\omega_{c1}^i,\omega_{c2}^i$ 7, batch size 256, 150 epochs, warmup $\omega_{c1}^i,\omega_{c2}^i$ 8 constant $\omega_{c1}^i,\omega_{c2}^i$ 9 polynomial decay schedule.
Regularization: Hamming window in Sinc filters acts as a smoothing constraint.

All parameters, including the Sinc-extractor, are optimized end-to-end.

5. Experimental Results and Ablation Studies

SincQDR-VAD’s performance was benchmarked on AVA-Speech, noisy AVA-Speech (adding ESC-50 noise at 10, 5, 0, $s_i[n] = b_i\,\tilde{s}_i[n-R]\,h[n] ,\quad 0\leq n < L$ 0, $s_i[n] = b_i\,\tilde{s}_i[n-R]\,h[n] ,\quad 0\leq n < L$ 1 dB SNR), and ACAM datasets. Primary metrics are AUROC and $s_i[n] = b_i\,\tilde{s}_i[n-R]\,h[n] ,\quad 0\leq n < L$ 2-Score (threshold 0.5).

Performance summary:

Model	Params	AVA AUROC/ $s_i[n] = b_i\,\tilde{s}_i[n-R]\,h[n] ,\quad 0\leq n < L$ 3	Noisy AVA Avg AUROC	ACAM AUROC/ $s_i[n] = b_i\,\tilde{s}_i[n-R]\,h[n] ,\quad 0\leq n < L$ 4
SincQDR-VAD	8.0 K	0.914/0.911	0.815	0.97/0.92
TinyVAD	11.6 K	0.864/0.645	0.799	0.96/0.65
MarbleNet	88.9 K	0.858/0.635	0.747	—

At $s_i[n] = b_i\,\tilde{s}_i[n-R]\,h[n] ,\quad 0\leq n < L$ 5 dB SNR (noisy AVA), SincQDR-VAD achieves AUROC 0.709 vs. TinyVAD’s 0.691 and MarbleNet’s 0.620.

Ablation studies demonstrate:

Without Sinc-extractor: AUROC drops from 0.914 to 0.889 on AVA and 0.815 to 0.784 on noisy AVA; $s_i[n] = b_i\,\tilde{s}_i[n-R]\,h[n] ,\quad 0\leq n < L$ 6 drops from 0.911 to 0.881.
Without QDR loss: AUROC drops from 0.914 to 0.872 on AVA and 0.815 to 0.739 on noisy AVA; $s_i[n] = b_i\,\tilde{s}_i[n-R]\,h[n] ,\quad 0\leq n < L$ 7 drops to 0.883.

These results establish that both the learnable Sinc front-end and the ranking-aware loss are critical contributors to noise robustness and overall detection performance.

6. Computational Considerations and Deployment

SincQDR-VAD’s efficiency stems from its front-end and modular lightweight layers. Each Sinc-filter involves three learnable parameters, while grouped and depthwise convolutions minimize redundancy. With modest filter lengths (e.g., 251 taps) and shallow architecture, the model supports highly parallel inference on CPUs or low-power DSPs. Postprocessing applies an 87.5% overlap median filter for output smoothing with negligible latency, facilitating real-time, on-device deployment.

7. Significance and Practical Implications

SincQDR-VAD demonstrates that integrating task-adaptive, interpretable feature extraction with a pairwise ranking-oriented loss yields substantial gains in VAD robustness under challenging noise conditions while maintaining extreme parameter efficiency (Wang et al., 28 Aug 2025). The direct coupling of training objective and evaluation metric via QDR loss underscores a methodological direction emphasizing metric-aligned optimization. The combination of end-to-end learnable Sinc-filterbanks and computational parsimony positions SincQDR-VAD as a leading candidate for deployment in real-time, low-power, and harsh-acoustic environments, extending the operational capacity of VAD-centric speech technologies.

Markdown Report Issue Upgrade to Chat

References (1)

SincQDR-VAD: A Noise-Robust Voice Activity Detection Framework Leveraging Learnable Filters and Ranking-Aware Optimization (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SincQDR-VAD.