Sound Check: Auditing Audio Datasets

Published 17 Oct 2024 in cs.SD, cs.AI, cs.CY, and eess.AS | (2410.13114v1)

Abstract: Generative audio models are rapidly advancing in both capabilities and public utilization -- several powerful generative audio models have readily available open weights, and some tech companies have released high quality generative audio products. Yet, while prior work has enumerated many ethical issues stemming from the data on which generative visual and textual models have been trained, we have little understanding of similar issues with generative audio datasets, including those related to bias, toxicity, and intellectual property. To bridge this gap, we conducted a literature review of hundreds of audio datasets and selected seven of the most prominent to audit in more detail. We found that these datasets are biased against women, contain toxic stereotypes about marginalized communities, and contain significant amounts of copyrighted work. To enable artists to see if they are in popular audio datasets and facilitate exploration of the contents of these datasets, we developed a web tool audio datasets exploration tool at https://audio-audit.vercel.app.

Abstract PDF HTML Upgrade to Chat

Summary

The paper’s main contribution is its systematic audit of 175 audio datasets, revealing imbalances in usage and representation in generative audio research.
The study employs a comprehensive literature review from May 2023 to May 2024 to analyze dataset sources, scraping practices, and potential copyright infringements.
The findings highlight ethical concerns such as biased representations and toxic content, urging improved documentation and community-driven dataset reforms.

An Overview of "Sound Check: Auditing Audio Datasets"

This paper, "Sound Check: Auditing Audio Datasets," addresses significant gaps in the understanding of audio datasets used to train generative audio models. The authors highlight the escalating capabilities and adoption of generative audio technologies, juxtaposing this with a lack of comprehension regarding potential ethical issues such as bias, toxicity, and intellectual property (IP) challenges within the datasets.

Research Questions and Methodology

The paper's central inquiry revolves around four primary research questions:

What audio datasets are currently utilized, and what is the distribution of their use?
What are the sources of these datasets, and what licenses govern them?
Do these datasets contain toxic content?
Who is represented or underrepresented in these datasets?

To address these questions, the authors conducted a comprehensive literature review targeting audio datasets referenced in research papers from May 2023 to May 2024. This investigation revealed 175 unique datasets across categories such as music, speech, and environmental sounds. The analysis highlights the fragmented nature of audio datasets compared to the text and image domains, where a few large datasets dominate.

Core Findings

Dataset Characteristics and Usage

The paper finds that the majority of datasets were employed infrequently—mostly once or twice—emphasizing the distribution's long-tail nature. Notably, VCTK and AudioSet are among the datasets most frequently utilized. The number of hours and citations reveal a skew towards datasets like LibriSpeech and GTZAN, which enjoy frequent academic recognition beyond direct usage in papers.

Moreover, the analysis dissects dataset creation methods, uncovering that around 30% are scraped from web sources, raising questions about the legality and ethicality of such practices. The authors also highlight an alarming 27% of datasets potentially infringe on copyrights, a concern exacerbated by the presence of content from platforms like YouTube and TED Talks.

Bias and Toxicity

The paper's audit of seven prominent datasets reveals substantial representation issues. Gender, race, and other sociosensitive keywords are inadequately represented, with underrepresentation of marginalized communities evident in mentions. The authors also uncover associations between generic terms and specific gendered terms that reinforce stereotypical bias, particularly against women.

Toxic content, characterized by hate speech and profanity, was detected across these datasets, though it constitutes a small percentage of the total content. Nonetheless, given models' ability to amplify even minor content occurrences, this remains concerning.

Licensing and Ethical Concerns

Licensing is a key focus of this audit. Diverse licensing terms—notably those barring commercial use, adaptations, or requiring attribution—pose challenges for using these datasets in training or commercial settings. The authors’ exploration of dataset provenance further elucidates the complications of IP issues in generative audio contexts.

Theoretical and Practical Implications

The paper's findings underscore the need for enhanced documentation standards, akin to "datasheets for datasets," tailored to audio data’s unique challenges. Such documentation would facilitate better assessment of biases and foster ethical dataset consumption.

Practically, the implications for model developers and data curators are profound. There is a pressing need to build datasets that fairly represent diverse populations and respect legal boundaries of data use. Moreover, as the emerging paradigm of multimodal models leverages audio data, mitigating these datasets’ biases and ethical pitfalls becomes critical for wider AI fairness and applicability.

Future Directions

The closing sections suggest that addressing these challenges requires collaborative, community-driven dataset creation, emphasizing transparency and consent. The establishment of platforms enabling individuals to verify their inclusion in datasets is posited as a step towards more participatory and ethical AI development practices.

In conclusion, this paper serves as a clarion call for the audio AI community, advocating for a conscientious revisitation of dataset creation, licensing, and representation in an increasingly AI-driven world. The exploration accomplished here lays a foundation for ongoing dialogue and refinement in ethical audio AI advancement.