Papers
Topics
Authors
Recent
Search
2000 character limit reached

SG-VAD: Stochastic Gate Voice Activity Detection

Updated 28 January 2026
  • The paper introduces a novel denoising-based VAD approach that employs a stochastic gating mechanism to achieve state-of-the-art performance with only 7.8K parameters.
  • The architecture utilizes time-channel separable convolutions and a trainable binary gating mask to selectively filter spectral–temporal features for speech detection.
  • Empirical results show superior AUC-ROC on AVA-Speech and competitive performance on HAVIC, demonstrating its efficiency in low-resource and edge deployment scenarios.

SG-VAD (Stochastic Gates Based Voice Activity Detection) is a neural architecture for voice activity detection (VAD) that approaches the problem as a denoising and feature selection task. SG-VAD is designed for low-resource scenarios, capable of operating with approximately 7.8K parameters while achieving state-of-the-art or competitive performance relative to larger models on multiple standard datasets. The model employs a trainable binary gating mechanism over spectral–temporal features, suppressing nuisance inputs and yielding efficient, high-accuracy speech/non-speech discrimination (Svirsky et al., 2022).

1. Reformulation of Voice Activity Detection

Standard VAD requires mapping an input audio segment xRT×Fx \in \mathbb{R}^{T \times F} (where TT is the number of time frames and FF is the number of spectral features) to a binary speech/non-speech label per frame or for the whole segment. The SG-VAD framework reframes this as a denoising process: it models xx as the union of speech-informative components ss and nuisance features nn. Central to the approach is learning a binary gating mask z{0,1}T×Cz \in \{0,1\}^{T \times C} that selects time–channel positions relevant to speech. The SG-VAD network predicts zz, zeros out the masked features in xx, and forwards the filtered representation to a downstream classifier. A sparsity penalty on zz enforces denoising, as non-speech frames are suppressed via gating.

2. Network Architecture

SG-VAD comprises two modules:

2.1. SG-VAD Feature-Selector Network

This low-parameter core network ingests a 32-dimensional MFCC feature sequence xRT×32x \in \mathbb{R}^{T \times 32}. The architecture features:

  • Block 1: 1D time-channel separable convolution (kernel size 3×13 \times 1), batch normalization (BN), Tanh activation.
  • Block 2: Two residual separable convolutional layers (5×15 \times 1 and 3×13 \times 1), each with BN and Tanh.
  • Block 3: 1×11 \times 1 pointwise convolution generating μi,jRT×32\mu_{i,j} \in \mathbb{R}^{T \times 32}.
  • Stochastic gating layer applies a hard threshold with Gaussian noise to produce binary z{0,1}T×32z \in \{0,1\}^{T \times 32}.
  • The overall VAD score is computed as:

y^vad(x)=1Ti=1Tj=132zij\hat{y}_{\text{vad}}(x) = \frac{1}{T} \sum_{i=1}^T \sum_{j=1}^{32} z_i^j

2.2. Auxiliary Classifier (Training Only)

During training, the gated features xgated=xzx_{\text{gated}} = x \odot z are processed by a MarbleNet-based classifier (three separable conv blocks, channels=64), outputting a K=36K=36 class softmax (35 spoken commands + background).

Table 1: SG-VAD Model Blocks

Block Operation Output Shape
Input 32-dim MFCC sequence (T,32)(T,32)
Block 1 1D separable conv, BN, Tanh (T,32)(T,32)
Block 2 Residual conv (5x1, 3x1), BN, Tanh (T,32)(T,32)
Block 3 1x1 pointwise conv \rightarrow μi,j\mu_{i,j} (T,32)(T,32)
Gating Stochastic hard-threshold, Gaussian noise z{0,1}T,32z \in \{0,1\}^{T,32}

3. Stochastic Gate Mechanism

The gating mechanism is based on the STG (Stochastic Gates) Gaussian-relaxation of Bernoulli variables as introduced by Yamada et al. (2020). For each gate (i,j)(i,j),

ϵi,jN(0,σ2),    σ=0.5\epsilon_{i,j} \sim \mathcal{N}(0, \sigma^2 ),\;\; \sigma = 0.5

zi,j=max(0,min(1,0.5+μi,j+ϵi,j))z_{i,j} = \max \left(0,\, \min\left(1,\,0.5 + \mu_{i,j} + \epsilon_{i,j}\right)\right)

This hard-thresholded (clipped) stochastic transformation implements a differentiable relaxation of sampling from a Bernoulli distribution. It is used in training for regularization and enabling backpropagation via the reparameterization trick. At inference, σ20\sigma^2 \rightarrow 0 or the noise term is dropped, and zi,j=1[μi,j>0.5]z_{i,j}= \mathbf{1} [\mu_{i,j}>0.5 ] is used.

4. Loss Functions and Supervision

SG-VAD is trained end-to-end with supervision to both the auxiliary classifier and the gates, using:

  • Cross-entropy loss LceL_{\text{ce}} on the classifier's softmax output for all segments:

Lce(xi)=k=0351{yi=k}logp^i(k)L_{\text{ce}}(x_i) = -\sum_{k=0}^{35}\mathbf{1}\{y_i=k\} \log \hat{p}_i(k)

  • Gate-sparsity (0\ell_0) loss LsgL_{\text{sg}} applied only to background segments (yi=0)(y_i=0):

Lsg(xi)=i=1Tj=132zi,j=z0L_{\text{sg}}(x_i) = \sum_{i=1}^{T}\sum_{j=1}^{32} z_{i,j} = \|z\|_0

  • Total loss:

L(xi)={Lce(xi)+λLsg(xi),if yi=0 Lce(xi),otherwiseL(x_i) = \begin{cases} L_{\text{ce}}(x_i) + \lambda L_{\text{sg}}(x_i), & \text{if } y_i = 0 \ L_{\text{ce}}(x_i), & \text{otherwise} \end{cases}

with λ=1\lambda=1 in experiments.

This regime forces the network to close all gates on background (non-speech) segments, while allowing selective opening guided by cross-entropy during speech.

5. Training Protocol and Regularization

SG-VAD is trained using the Google Speech Commands V2 dataset (∼23 hours, 35 classes) combined with FS2K background noise clips. Audio augmentation includes time-shift (±5\pm 5 ms), additive white noise (–90 to –46 dB, p=0.8p=0.8), SpecAugment (time/frequency masking), and SpecCutout (rectangular masks). The optimizer is SGD (momentum=0.9, weight decay=10310^{-3}), batch size is 128, and the training schedule incorporates warmup, hold, and polynomial decay of the learning rate. Training lasts 150 epochs on a single GPU.

6. Empirical Performance and Ablation Analysis

Performance is assessed on AVA-Speech and HAVIC datasets using AUC-ROC as the primary metric. Comparisons with other compact and large-footprint VAD models reveal that SG-VAD achieves state-of-the-art AUC-ROC (94.3) on AVA-Speech with 7.8K parameters, outperforming ResectNet architectures trained on 20× more data. On HAVIC, SG-VAD delivers competitive AUC-ROC (83.3), rivaling larger models.

Table 2: SG-VAD and Baseline Model Performance

Model Params AVA-Speech AUC HAVIC AUC
ResectNet-0.5× 4.5K 88.6 83.5
ResectNet-1.0× 11.1K 90.0 84.9
Braun et al. (2021) 1.8M 92.4 86.8
MarbleNet 88K 85.8 80.4
SG-VAD 7.8K 94.3 83.3

Ablation studies confirm that proper loss scheduling—applying gate sparsity only on background—is essential. Misapplied sparsity or omitting it yields markedly reduced AUC (<63) despite high GSCV2 validation accuracy, indicating the necessity of SG-VAD’s regularization and training strategy for generalization.

7. Practical Considerations and Extensions

SG-VAD’s 7.8K parameter inference network is suited for edge/IoT deployment scenarios. Segment-level detection makes post-processing unnecessary. The stochastic gate paradigm is extensible: alternative relaxations such as Concrete or Gumbel-Softmax, or learning per-gate variance σ\sigma, are viable. Gating can be adapted upstream in other ASR or sound event detection pipelines for denoising feature representations. Open-source code and pretrained models are provided at https://github.com/jsvir/vad (Svirsky et al., 2022).

A plausible implication is that the stochastic gating approach generalizes beyond VAD, offering a lightweight, easily integrated denoising mechanism for other audio or time-frequency tasks where nuisance feature suppression is beneficial.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SG-VAD.