Speech Commands Dataset

Updated 16 February 2026

Speech Commands Dataset is a public audio corpus with 35 spoken-word classes designed for benchmarking limited-vocabulary ASR and keyword spotting systems.
It includes standardized training, validation, and testing splits to ensure reproducible, fair comparisons across diverse speech models.
Evaluation methodologies feature top-one accuracy and streaming metrics, with a CNN baseline achieving an 88.2% accuracy on the v2 test set.

The Speech Commands Dataset is a large-scale, publicly available collection of short spoken-word utterances specifically designed to benchmark limited-vocabulary speech recognition and keyword spotting systems. It provides a standardized audio corpus with well-defined splits, quality control, and evaluation methodologies, establishing itself as the primary reference benchmark for isolated-keyword recognition in noisy, speaker-diverse conditions (Warden, 2018).

1. Dataset Composition and Structure

Speech Commands is available in multiple versions, with version 2 (v2) being the canonical release for most research. The v2 dataset consists of 35 distinct word classes:

Core command words (20): “zero”–“nine”, “yes”, “no”, “up”, “down”, “left”, “right”, “on”, “off”, “stop”, “go”
v2 additions (4): “backward”, “forward”, “follow”, “learn”
Auxiliary words (10): “bed”, “bird”, “cat”, “dog”, “happy”, “house”, “marvin”, “sheila”, “tree”, “wow” (used to populate the “unknown” class)
Silence class: Constructed from short background noise clips

The corpus contains 105 829 utterances (v2), with each word typically represented by between ~1 600 and ~4 000 clips (see Table 1 of (Warden, 2018) for per-class counts). Each audio sample is a single-channel, 16 kHz, 16-bit linear PCM WAV of nominal duration 1 s. All files are normalized and trimmed to the loudest 1 s window using a custom tool to maximize foreground speech energy.

The total uncompressed size is approximately 3.8 GB (~2.7 GB gzipped).

2. Data Collection and Quality Control

Audio was crowd-sourced via a web application (WebAudioAPI backend) on commodity desktop and Android devices. Contributors received random prompts, with core words sampled five times, and auxiliary words once, per session (135 prompts/session). Speakers were instructed to record in quiet rooms; no studio hardware was required.

Quality control pipeline:

Automatic file-size screening: Remove OGG uploads <5 kB.
Volume filtering: Discard clips with mean absolute normalized amplitude $v < 0.004$ ; $v = \frac{1}{N}\sum_{i=1}^{N} |x_i|$ for normalized $x_i \in [-1, 1]$ .
Loudest window extraction: Retain only the 1 s segment with maximum mean absolute amplitude.
Manual transcription (single-pass): Reject utterances when the label does not match the transcript.

A total of 2 618 unique (anonymous) speakers contributed data, maximizing demographic and accent diversity via open online distribution. All labels are English-only; gender and accent were not explicitly annotated.

3. Data Splits and Usage Protocols

Partitioning is critical for reproducibility and fair benchmarking. The dataset provides canonical filename lists:

validation_list.txt: ≈10% of files, for hyperparameter tuning
testing_list.txt: ≈10%, held for final evaluation
Training: ≈80% of data, all other files

Splits are determined by hashing filenames to exactly preserve partition mapping across releases and experiments, and no speaker appears in more than one split for any file. Users are strongly advised to always use these lists to ensure comparability of results. No random seeding or custom splitting is necessary for standard benchmarking.

All content is licensed under Creative Commons Attribution 4.0 (CC BY 4.0), permitting unrestricted academic and commercial use with attribution.

4. Evaluation Methodologies and Metrics

Recognition performance is standardized along three axes:

Top-One (per-utterance) Classification Accuracy:

$\mathrm{Accuracy} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}\{\hat{y}_i = y_i\}$

where $\hat{y}_i$ and $y_i$ are the predicted and ground-truth labels for utterance $i$ .

Open-World 12-way Task: At test time, models are evaluated on 12 balanced classes: 10 core commands + 1 “unknown” class + 1 “silence” class, each accounting for ~8.3% of the test set.
Streaming Evaluation: For continuous-audio use cases, models are run over a 10-min mixture of background noise and utterances, evaluated with event-matching metrics (matched/wrong/false-positive rates), using a sliding window and suppression rules with a 750 ms default time tolerance. A baseline v2 model yields 49.0% matched rate (46.0% correct), with 0% false positives for default settings.

Feature extraction for most baselines employs 32 or 40-dimensional MFCCs (usually 30 ms window, 10 ms hop), followed by per-utterance mean-variance normalization. Models may additionally stack frames into spectrogram-like 2D arrays.

5. Baseline Models and Published Results

The canonical baseline is a compact CNN as used in TensorFlow’s tutorial:

Architecture: Two convolutional layers (8×8, stride 1-2; 4×4, stride 1-2, 64 channels each), followed by a 256-unit fully connected layer and final softmax over 12 outputs.
Training: Adam optimizer, learning rate 0.001, batch size 100, 18 000 steps (≈20–30 epochs).

Published accuracy:

Training Set	Test on v1	Test on v2
v1	85.4%	82.7%
v2	89.7%	88.2%

The v2-trained baseline achieves 88.2% top-one accuracy on the v2 test set, and 49% event-matched rate on streaming audio with 0% false positives (Warden, 2018).

6. Limitations, Challenges, and Extensions

Primary challenges include:

Phonetically similar ("confusable") pairs: e.g., “tree” vs. “three”
Noisy/silent input: Increases the risk of false rejections
Streaming inference: Demands suppression of repeated triggers and accurate timing

Recommendations for dataset improvements:

Expansion of the fixed vocabulary (beyond 35 words)
Augmentation with extended background noise (urban, domestic, vehicular)
Data augmentation at training: random time-shifts, additive noise at controlled SNRs, pitch/time warping
Multilingual collection: Non-English pronunciations, additional languages
Addition of speaker metadata (gender, age, accent) for bias/fairness research

This suggests the dataset, while comprehensive within its scope, may require community-driven extensions for transfer learning, cross-lingual benchmarking, fairness auditing, and real-world ambient noise robustness.

7. Significance and Impact

Speech Commands has become the de facto benchmark for keyword spotting and limited-vocabulary ASR. Its design enables reproducible, fair comparisons across models, facilitating rapid progress in neural architectures for on-device speech processing—especially under strict compute and memory constraints. The dataset's open license, rigor in data collection and partitioning, and inclusion of unknown/silence classes make it uniquely applicable for evaluating small-footprint, robust speech-understanding systems in both academic and commercial research. Its limitations also highlight key open problems—phonetic ambiguity, domain adaptation, noise, and demographic fairness—driving further innovation in dataset curation and keyword-spotting algorithms (Warden, 2018).

Markdown Report Issue Upgrade to Chat

References (1)

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Speech Commands Dataset.