SG-VAD: Stochastic Gate Voice Activity Detection
- The paper introduces a novel denoising-based VAD approach that employs a stochastic gating mechanism to achieve state-of-the-art performance with only 7.8K parameters.
- The architecture utilizes time-channel separable convolutions and a trainable binary gating mask to selectively filter spectral–temporal features for speech detection.
- Empirical results show superior AUC-ROC on AVA-Speech and competitive performance on HAVIC, demonstrating its efficiency in low-resource and edge deployment scenarios.
SG-VAD (Stochastic Gates Based Voice Activity Detection) is a neural architecture for voice activity detection (VAD) that approaches the problem as a denoising and feature selection task. SG-VAD is designed for low-resource scenarios, capable of operating with approximately 7.8K parameters while achieving state-of-the-art or competitive performance relative to larger models on multiple standard datasets. The model employs a trainable binary gating mechanism over spectral–temporal features, suppressing nuisance inputs and yielding efficient, high-accuracy speech/non-speech discrimination (Svirsky et al., 2022).
1. Reformulation of Voice Activity Detection
Standard VAD requires mapping an input audio segment (where is the number of time frames and is the number of spectral features) to a binary speech/non-speech label per frame or for the whole segment. The SG-VAD framework reframes this as a denoising process: it models as the union of speech-informative components and nuisance features . Central to the approach is learning a binary gating mask that selects time–channel positions relevant to speech. The SG-VAD network predicts , zeros out the masked features in , and forwards the filtered representation to a downstream classifier. A sparsity penalty on enforces denoising, as non-speech frames are suppressed via gating.
2. Network Architecture
SG-VAD comprises two modules:
2.1. SG-VAD Feature-Selector Network
This low-parameter core network ingests a 32-dimensional MFCC feature sequence . The architecture features:
- Block 1: 1D time-channel separable convolution (kernel size ), batch normalization (BN), Tanh activation.
- Block 2: Two residual separable convolutional layers ( and ), each with BN and Tanh.
- Block 3: pointwise convolution generating .
- Stochastic gating layer applies a hard threshold with Gaussian noise to produce binary .
- The overall VAD score is computed as:
2.2. Auxiliary Classifier (Training Only)
During training, the gated features are processed by a MarbleNet-based classifier (three separable conv blocks, channels=64), outputting a class softmax (35 spoken commands + background).
Table 1: SG-VAD Model Blocks
| Block | Operation | Output Shape |
|---|---|---|
| Input | 32-dim MFCC sequence | |
| Block 1 | 1D separable conv, BN, Tanh | |
| Block 2 | Residual conv (5x1, 3x1), BN, Tanh | |
| Block 3 | 1x1 pointwise conv | |
| Gating | Stochastic hard-threshold, Gaussian noise |
3. Stochastic Gate Mechanism
The gating mechanism is based on the STG (Stochastic Gates) Gaussian-relaxation of Bernoulli variables as introduced by Yamada et al. (2020). For each gate ,
This hard-thresholded (clipped) stochastic transformation implements a differentiable relaxation of sampling from a Bernoulli distribution. It is used in training for regularization and enabling backpropagation via the reparameterization trick. At inference, or the noise term is dropped, and is used.
4. Loss Functions and Supervision
SG-VAD is trained end-to-end with supervision to both the auxiliary classifier and the gates, using:
- Cross-entropy loss on the classifier's softmax output for all segments:
- Gate-sparsity () loss applied only to background segments :
- Total loss:
with in experiments.
This regime forces the network to close all gates on background (non-speech) segments, while allowing selective opening guided by cross-entropy during speech.
5. Training Protocol and Regularization
SG-VAD is trained using the Google Speech Commands V2 dataset (∼23 hours, 35 classes) combined with FS2K background noise clips. Audio augmentation includes time-shift ( ms), additive white noise (–90 to –46 dB, ), SpecAugment (time/frequency masking), and SpecCutout (rectangular masks). The optimizer is SGD (momentum=0.9, weight decay=), batch size is 128, and the training schedule incorporates warmup, hold, and polynomial decay of the learning rate. Training lasts 150 epochs on a single GPU.
6. Empirical Performance and Ablation Analysis
Performance is assessed on AVA-Speech and HAVIC datasets using AUC-ROC as the primary metric. Comparisons with other compact and large-footprint VAD models reveal that SG-VAD achieves state-of-the-art AUC-ROC (94.3) on AVA-Speech with 7.8K parameters, outperforming ResectNet architectures trained on 20× more data. On HAVIC, SG-VAD delivers competitive AUC-ROC (83.3), rivaling larger models.
Table 2: SG-VAD and Baseline Model Performance
| Model | Params | AVA-Speech AUC | HAVIC AUC |
|---|---|---|---|
| ResectNet-0.5× | 4.5K | 88.6 | 83.5 |
| ResectNet-1.0× | 11.1K | 90.0 | 84.9 |
| Braun et al. (2021) | 1.8M | 92.4 | 86.8 |
| MarbleNet | 88K | 85.8 | 80.4 |
| SG-VAD | 7.8K | 94.3 | 83.3 |
Ablation studies confirm that proper loss scheduling—applying gate sparsity only on background—is essential. Misapplied sparsity or omitting it yields markedly reduced AUC (<63) despite high GSCV2 validation accuracy, indicating the necessity of SG-VAD’s regularization and training strategy for generalization.
7. Practical Considerations and Extensions
SG-VAD’s 7.8K parameter inference network is suited for edge/IoT deployment scenarios. Segment-level detection makes post-processing unnecessary. The stochastic gate paradigm is extensible: alternative relaxations such as Concrete or Gumbel-Softmax, or learning per-gate variance , are viable. Gating can be adapted upstream in other ASR or sound event detection pipelines for denoising feature representations. Open-source code and pretrained models are provided at https://github.com/jsvir/vad (Svirsky et al., 2022).
A plausible implication is that the stochastic gating approach generalizes beyond VAD, offering a lightweight, easily integrated denoising mechanism for other audio or time-frequency tasks where nuisance feature suppression is beneficial.