MUSAN: A Music, Speech, and Noise Corpus

Published 28 Oct 2015 in cs.SD | (1510.08484v1)

Abstract: This report introduces a new corpus of music, speech, and noise. This dataset is suitable for training models for voice activity detection (VAD) and music/speech discrimination. Our corpus is released under a flexible Creative Commons license. The dataset consists of music from several genres, speech from twelve languages, and a wide assortment of technical and non-technical noises. We demonstrate use of this corpus for music/speech discrimination on Broadcast news and VAD for speaker identification.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (1,294)

View on Semantic Scholar

Summary

The paper introduces a robust 109-hour MUSAN corpus that supports training for voice activity detection and music/speech discrimination.
The dataset aggregates diverse audio including 60 hours of multilingual speech, 42 hours of annotated music, and 6 hours of varied noise, all under open licenses.
Experimental evaluations using GMM-based systems reveal significant performance gains, with up to 23.2% relative improvement in error rates.

MUSAN: A Music, Speech, and Noise Corpus

The research by Snyder et al. introduces the MUSAN corpus, a comprehensive dataset of music, speech, and noise tailored for voice activity detection (VAD) and music/speech discrimination. This corpus seeks to mitigate frequent issues concerning intellectual property, providing an openly accessible dataset under flexible Creative Commons and US Public Domain licenses.

Dataset Composition

The MUSAN corpus comprises roughly 109 hours of audio data partitioned into three categories: speech, music, and noise. Notably, the speech subset includes 60 hours spanning read speech in 12 languages sourced from Librivox and US government recordings. The music dataset covers 42 hours sourced from platforms such as Jamendo and Free Music Archive, and is annotated for genre and vocal presence. Additionally, the corpus includes 6 hours of various noise types ranging from technical sounds like DTMF tones to ambient noises such as rustling paper and car idling.

Utilization and Licensing

The unique value proposition of MUSAN lies in its adherence to open licensing, which permits broad usability, including commercial applications. Each audio file in the corpus is meticulously annotated with relevant metadata, licensing information, and, where applicable, attributions, making it straightforward for users to navigate and understand the provenance and legal considerations of the data.

Experimental Evaluations

To validate the utility of the MUSAN corpus, the authors conducted a series of experiments in music/speech discrimination and VAD, leveraging GMM-based systems due to their simplicity and effectiveness in prior studies.

Music/Speech Discrimination:

The authors trained GMM models on the MUSAN speech and music subsets and tested them against the GTZAN dataset using Broadcast News as the test corpus. The evaluation criterion was the Equal Error Rate (EER). Results show that the MUSAN-trained models perform comparably to those trained on GTZAN at various GMM sizes, indicating robust performance across models.

Voice Activity Detection:

The study also explored VAD in the context of speaker recognition, deploying a system based on a combination of GMM and energy-based VAD techniques. When tested on the NIST SRE 2010 evaluation dataset, integrating GMM-based VAD demonstrated improved performance, particularly when limited speech data (down to 1 second) was available. The enhancements were quantified as relative improvements in EER ranging from approximately 6.7% to 23.2% across varying amounts of available speech.

Implications and Future Directions

The MUSAN corpus offers significant practical and theoretical implications. Practically, the dataset’s public domain licensing fosters its wide adoption in both academic and commercial domains, accelerating research and development in VAD and music/speech discrimination technologies. Theoretically, the availability of diverse and annotated audio data enables rigorous experimentation and benchmarking, aiding the advancement of robust audio classification algorithms.

Future research could build on this work by exploring more complex models such as deep learning approaches, which have shown promise in recent years. Additionally, expanding the dataset to include more languages and noise types could further enhance its utility and generalizability. The MUSAN dataset sets a solid foundation for the ongoing development and refinement of audio classification systems and underscores the importance of accessible and openly licensed datasets in advancing the field.

Overall, Snyder et al.'s contribution of the MUSAN corpus represents a substantial resource for the speech and audio processing community, facilitating further innovation and application development in this critical area.

Markdown Report Issue