SG-VAD: Stochastic Gate Voice Activity Detection

Updated 28 January 2026

The paper introduces a novel denoising-based VAD approach that employs a stochastic gating mechanism to achieve state-of-the-art performance with only 7.8K parameters.
The architecture utilizes time-channel separable convolutions and a trainable binary gating mask to selectively filter spectral–temporal features for speech detection.
Empirical results show superior AUC-ROC on AVA-Speech and competitive performance on HAVIC, demonstrating its efficiency in low-resource and edge deployment scenarios.

SG-VAD (Stochastic Gates Based Voice Activity Detection) is a neural architecture for voice activity detection (VAD) that approaches the problem as a denoising and feature selection task. SG-VAD is designed for low-resource scenarios, capable of operating with approximately 7.8K parameters while achieving state-of-the-art or competitive performance relative to larger models on multiple standard datasets. The model employs a trainable binary gating mechanism over spectral–temporal features, suppressing nuisance inputs and yielding efficient, high-accuracy speech/non-speech discrimination (Svirsky et al., 2022).

1. Reformulation of Voice Activity Detection

Standard VAD requires mapping an input audio segment $x \in \mathbb{R}^{T \times F}$ (where $T$ is the number of time frames and $F$ is the number of spectral features) to a binary speech/non-speech label per frame or for the whole segment. The SG-VAD framework reframes this as a denoising process: it models $x$ as the union of speech-informative components $s$ and nuisance features $n$ . Central to the approach is learning a binary gating mask $z \in \{0,1\}^{T \times C}$ that selects time–channel positions relevant to speech. The SG-VAD network predicts $z$ , zeros out the masked features in $x$ , and forwards the filtered representation to a downstream classifier. A sparsity penalty on $z$ enforces denoising, as non-speech frames are suppressed via gating.

2. Network Architecture

SG-VAD comprises two modules:

2.1. SG-VAD Feature-Selector Network

This low-parameter core network ingests a 32-dimensional MFCC feature sequence $x \in \mathbb{R}^{T \times 32}$ . The architecture features:

Block 1: 1D time-channel separable convolution (kernel size $3 \times 1$ ), batch normalization (BN), Tanh activation.
Block 2: Two residual separable convolutional layers ( $5 \times 1$ and $3 \times 1$ ), each with BN and Tanh.
Block 3: $1 \times 1$ pointwise convolution generating $\mu_{i,j} \in \mathbb{R}^{T \times 32}$ .
Stochastic gating layer applies a hard threshold with Gaussian noise to produce binary $z \in \{0,1\}^{T \times 32}$ .
The overall VAD score is computed as:

$\hat{y}_{\text{vad}}(x) = \frac{1}{T} \sum_{i=1}^T \sum_{j=1}^{32} z_i^j$

2.2. Auxiliary Classifier (Training Only)

During training, the gated features $x_{\text{gated}} = x \odot z$ are processed by a MarbleNet-based classifier (three separable conv blocks, channels=64), outputting a $K=36$ class softmax (35 spoken commands + background).

Table 1: SG-VAD Model Blocks

Block	Operation	Output Shape
Input	32-dim MFCC sequence	$(T,32)$
Block 1	1D separable conv, BN, Tanh	$(T,32)$
Block 2	Residual conv (5x1, 3x1), BN, Tanh	$(T,32)$
Block 3	1x1 pointwise conv $\rightarrow$ $\mu_{i,j}$	$(T,32)$
Gating	Stochastic hard-threshold, Gaussian noise	$z \in \{0,1\}^{T,32}$

3. Stochastic Gate Mechanism

The gating mechanism is based on the STG (Stochastic Gates) Gaussian-relaxation of Bernoulli variables as introduced by Yamada et al. (2020). For each gate $(i,j)$ ,

$\epsilon_{i,j} \sim \mathcal{N}(0, \sigma^2 ),\;\; \sigma = 0.5$

$z_{i,j} = \max \left(0,\, \min\left(1,\,0.5 + \mu_{i,j} + \epsilon_{i,j}\right)\right)$

This hard-thresholded (clipped) stochastic transformation implements a differentiable relaxation of sampling from a Bernoulli distribution. It is used in training for regularization and enabling backpropagation via the reparameterization trick. At inference, $\sigma^2 \rightarrow 0$ or the noise term is dropped, and $z_{i,j}= \mathbf{1} [\mu_{i,j}>0.5 ]$ is used.

4. Loss Functions and Supervision

SG-VAD is trained end-to-end with supervision to both the auxiliary classifier and the gates, using:

Cross-entropy loss $L_{\text{ce}}$ on the classifier's softmax output for all segments:

$L_{\text{ce}}(x_i) = -\sum_{k=0}^{35}\mathbf{1}\{y_i=k\} \log \hat{p}_i(k)$

Gate-sparsity ( $\ell_0$ ) loss $L_{\text{sg}}$ applied only to background segments $(y_i=0)$ :

$L_{\text{sg}}(x_i) = \sum_{i=1}^{T}\sum_{j=1}^{32} z_{i,j} = \|z\|_0$

Total loss:

$L(x_i) = \begin{cases} L_{\text{ce}}(x_i) + \lambda L_{\text{sg}}(x_i), & \text{if } y_i = 0 \ L_{\text{ce}}(x_i), & \text{otherwise} \end{cases}$

with $\lambda=1$ in experiments.

This regime forces the network to close all gates on background (non-speech) segments, while allowing selective opening guided by cross-entropy during speech.

5. Training Protocol and Regularization

SG-VAD is trained using the Google Speech Commands V2 dataset (∼23 hours, 35 classes) combined with FS2K background noise clips. Audio augmentation includes time-shift ( $\pm 5$ ms), additive white noise (–90 to –46 dB, $p=0.8$ ), SpecAugment (time/frequency masking), and SpecCutout (rectangular masks). The optimizer is SGD (momentum=0.9, weight decay= $10^{-3}$ ), batch size is 128, and the training schedule incorporates warmup, hold, and polynomial decay of the learning rate. Training lasts 150 epochs on a single GPU.

6. Empirical Performance and Ablation Analysis

Performance is assessed on AVA-Speech and HAVIC datasets using AUC-ROC as the primary metric. Comparisons with other compact and large-footprint VAD models reveal that SG-VAD achieves state-of-the-art AUC-ROC (94.3) on AVA-Speech with 7.8K parameters, outperforming ResectNet architectures trained on 20× more data. On HAVIC, SG-VAD delivers competitive AUC-ROC (83.3), rivaling larger models.

Table 2: SG-VAD and Baseline Model Performance

Model	Params	AVA-Speech AUC	HAVIC AUC
ResectNet-0.5×	4.5K	88.6	83.5
ResectNet-1.0×	11.1K	90.0	84.9
Braun et al. (2021)	1.8M	92.4	86.8
MarbleNet	88K	85.8	80.4
SG-VAD	7.8K	94.3	83.3

Ablation studies confirm that proper loss scheduling—applying gate sparsity only on background—is essential. Misapplied sparsity or omitting it yields markedly reduced AUC (<63) despite high GSCV2 validation accuracy, indicating the necessity of SG-VAD’s regularization and training strategy for generalization.

7. Practical Considerations and Extensions

SG-VAD’s 7.8K parameter inference network is suited for edge/IoT deployment scenarios. Segment-level detection makes post-processing unnecessary. The stochastic gate paradigm is extensible: alternative relaxations such as Concrete or Gumbel-Softmax, or learning per-gate variance $\sigma$ , are viable. Gating can be adapted upstream in other ASR or sound event detection pipelines for denoising feature representations. Open-source code and pretrained models are provided at https://github.com/jsvir/vad (Svirsky et al., 2022).

A plausible implication is that the stochastic gating approach generalizes beyond VAD, offering a lightweight, easily integrated denoising mechanism for other audio or time-frequency tasks where nuisance feature suppression is beneficial.

Markdown Report Issue Upgrade to Chat

References (1)

SG-VAD: Stochastic Gates Based Speech Activity Detection (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SG-VAD.