The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation

Published 16 Nov 2023 in cs.SD, cs.AI, cs.CL, and eess.AS | (2311.10057v3)

Abstract: We introduce the Song Describer dataset (SDD), a new crowdsourced corpus of high-quality audio-caption pairs, designed for the evaluation of music-and-LLMs. The dataset consists of 1.1k human-written natural language descriptions of 706 music recordings, all publicly accessible and released under Creative Common licenses. To showcase the use of our dataset, we benchmark popular models on three key music-and-language tasks (music captioning, text-to-music generation and music-language retrieval). Our experiments highlight the importance of cross-dataset evaluation and offer insights into how researchers can use SDD to gain a broader understanding of model performance.

Abstract PDF HTML Upgrade to Chat

References (36)

Citations (22)

View on Semantic Scholar

Summary

The paper introduces a curated audio-caption corpus with 1,106 descriptions for 706 music recordings to enhance music-language model evaluation.
It benchmarks models in music captioning, text-to-music generation, and retrieval using metrics like BLEU, METEOR, FAD, and recall to reveal domain shifts.
The dataset’s open licensing and rich annotations offer a robust foundation to mitigate overfitting and drive generalizable research.

An Overview of the Song Describer Dataset for Music-and-LLM Evaluation

The paper introduces the Song Describer Dataset (SDD), a curated corpus of audio-caption pairs aimed at facilitating the evaluation of music-and-language (M&L) models. This dataset addresses a critical challenge in M&L research: the lack of openly accessible and richly annotated datasets that allow for systematic evaluation across diverse tasks like music captioning, text-to-music generation, and music-to-language retrieval.

Dataset Composition and Characteristics

The Song Describer Dataset comprises 1,106 human-written descriptions associated with 706 freely available music recordings. These recordings are covered under Creative Commons licenses, ensuring that the dataset is open for use within the research community. Captions in the SDD dataset are rich and descriptive, covering various musical dimensions such as instrumentation, genre, and mood.

SDD stands out for its annotations on extended track segments, with most tracks lasting two minutes, unlike the shorter clips found in comparable datasets like MusicCaps and YT8M-MusicTextClips. SDD recordings are accompanied by persistent, openly licensed audio, thereby mitigating issues related to data persistence often faced with datasets relying on platforms like YouTube. Furthermore, SDD includes contributions from a diverse pool of annotators, providing varied perspectives that enrich the dataset’s descriptive quality.

Tasks and Evaluation

The paper demonstrates SDD's utility by benchmarking several state-of-the-art models. These models are evaluated across several tasks:

Music Captioning: Output from models such as MusCaps and LP-MusicCaps was assessed using automatic metrics like BLEU, METEOR, and BERT-score. Both models exhibited significant performance drop when evaluated out-of-domain using SDD, underlining the necessity of diverse evaluation datasets.
Text-to-Music Generation: Using metrics including Frechet Audio Distance (FAD) and Inception Score (IS), models like Riffusion and AudioLDM were evaluated. While these models performed variably across metrics, the comparison between datasets illustrated important considerations for evaluating semantic relevance and audio quality concurrently.
Text-to-Music Retrieval: Here, models such as TTMR and CLAP were assessed using information retrieval metrics like Recall at k and Median Rank (MedR). CLAP outperformed TTMR consistently across both datasets, yet relative performance demonstrated potential biases of in-distribution testing.

Implications and Future Directions

The introduction of SDD holds significant implications for both practical implementations and theoretical advancements in music-and-language research. By providing a well-documented and publicly accessible dataset, the authors give researchers a new tool for benchmarking models against real-world data. SDD helps to mitigate overfitting concerns prevalent when models are exclusively trained and tested on similar data distributions, as seen with private datasets.

The authors note the limitations in the scale of current data and highlight the value of incorporating more varied musical cultures to further enrich the dataset. Moreover, future SDD iterations may include additional data to solidify its reliability as a standard benchmarking tool.

Conclusion

The Song Describer Dataset presents a well-constructed resource for the M&L research community, offering enhanced evaluation capabilities across key model tasks. By providing diverse, high-quality data, SDD aligns with the broader objective of innovating in multimodal music understanding and generation, paving the way for more robust and generalizable machine learning applications in this domain.