Papers
Topics
Authors
Recent
Search
2000 character limit reached

The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation

Published 16 Nov 2023 in cs.SD, cs.AI, cs.CL, and eess.AS | (2311.10057v3)

Abstract: We introduce the Song Describer dataset (SDD), a new crowdsourced corpus of high-quality audio-caption pairs, designed for the evaluation of music-and-LLMs. The dataset consists of 1.1k human-written natural language descriptions of 706 music recordings, all publicly accessible and released under Creative Common licenses. To showcase the use of our dataset, we benchmark popular models on three key music-and-language tasks (music captioning, text-to-music generation and music-language retrieval). Our experiments highlight the importance of cross-dataset evaluation and offer insights into how researchers can use SDD to gain a broader understanding of model performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. YouTube-8M: A Large-Scale Video Classification Benchmark, Sept. 2016. arXiv:1609.08675 [cs].
  2. MusicLM: Generating Music From Text, Jan. 2023. arXiv:2301.11325 [cs, eess].
  3. The million song dataset. In Proceedings of the 12th ISMIR Conference, 2011.
  4. The MTG-Jamendo Dataset for Automatic Music Tagging. In Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML), 2019.
  5. What Will it Take to Fix Benchmarking in Natural Language Understanding?, Oct. 2021. arXiv:2104.02145 [cs].
  6. Simple and Controllable Music Generation, June 2023. arXiv:2306.05284 [cs, eess].
  7. LP-MusicCaps: LLM-Based Pseudo Music Captioning. In Proceedings of the 24th International Society for Music Information Retrieval (ISMIR) Conference, Milan, July 2023. arXiv:2307.16372 [cs, eess].
  8. Toward Universal Text-to-Music Retrieval. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022. arXiv:2211.14558 [cs, eess].
  9. Data-Efficient Playlist Captioning With Musical and Linguistic Knowledge. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11401–11415, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics.
  10. Datasheets for datasets. Communications of the ACM, 64(12):86–92, Nov. 2021.
  11. Audio Set: An ontology and human-labeled dataset for audio events. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pages 776–780. Institute of Electrical and Electronics Engineers Inc., June 2017. ISSN: 15206149.
  12. DeLiGAN: Generative Adversarial Networks for Diverse and Limited Data. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4941–4949, Honolulu, HI, July 2017. IEEE.
  13. MuLan: A Joint Embedding of Music Audio and Natural Language. In 23rd International Society for Music Information Retrieval Conference (ISMIR 2022), 2022.
  14. Noise2Music: Text-conditioned Music Generation with Diffusion Models, Mar. 2023. arXiv:2302.03917 [cs, eess].
  15. Fréchet Audio Distance: A Reference-Free Metric for Evaluating Music Enhancement Algorithms. In Interspeech 2019, pages 2350–2354. ISCA, Sept. 2019.
  16. A. Lavie and A. Agarwal. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of the Second Workshop on Statistical Machine Translation, 2007.
  17. Conversational Music Retrieval with Synthetic Data. In Second Workshop on Interactive Learning for Natural Language Processing at NeurIPS 2022, 2022.
  18. C. Y. Lin. Rouge: A package for automatic evaluation of summaries. Proceedings of the workshop on text summarization branches out (WAS 2004), 2004.
  19. AudioLDM: Text-to-Audio Generation with Latent Diffusion Models, Feb. 2023. arXiv:2301.12503 [cs, eess].
  20. Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning, Aug. 2023. arXiv:2308.11276 [cs, eess].
  21. MusCaps: Generating Captions for Music Audio. In 2021 International Joint Conference on Neural Networks (IJCNN). IEEE, 2021. arXiv: 2104.11984.
  22. Contrastive Audio-Language Learning for Music. In 23rd International Society for Music Information Retrieval Conference (ISMIR 2022), 2022. arXiv: 2208.12208.
  23. Learning music audio representations via weak language supervision. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022. arXiv: 2112.04214.
  24. Song Describer: a Platform for Collecting Textual Descriptions of Music Recordings. In Proceedings of the 23nd International Society for Music Information Retrieval Conference, Dec. 2022. Conference Name: Ismir 2022 Hybrid Conference.
  25. Language-Guided Music Recommendation for Video via Prompt Analogies, June 2023. arXiv:2306.09327 [cs].
  26. The Musicality of Non-Musicians: An Index for Assessing Musical Sophistication in the General Population. PloS one, 9(2), Feb. 2014. Publisher: Public Library of Science.
  27. BLEU : a Method for Automatic Evaluation of Machine Translation. Computational Linguistics, 2002.
  28. Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning. PMLR, 2021.
  29. Do ImageNet Classifiers Generalize to ImageNet? In Proceedings of the 36th International Conference on Machine Learning, pages 5389–5400. PMLR, May 2019. ISSN: 2640-3498.
  30. Mousai: Text-to-Music Generation with Long-Context Latent Diffusion, Jan. 2023. arXiv:2301.11757 [cs, eess].
  31. H. M. Seth Forsgren. Riffusion - Stable diffusion for real-time music generation, 2022.
  32. When does dough become a bagel?Analyzing the remaining mistakes on ImageNet. In ICML 2022 Shift Happens Workshop, May 2023.
  33. B. Weck and X. Serra. Data Leakage in Cross-Modal Retrieval Training: A Case Study. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, June 2023.
  34. Audio-Text Models Do Not Yet Leverage Natural Language, Mar. 2023. arXiv:2303.10667 [cs, eess].
  35. Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation, Apr. 2023. arXiv:2211.06687 [cs, eess].
  36. BERTScore: Evaluating Text Generation with BERT. In International Conference on Learning Representations 2020, 2020.
Citations (22)

Summary

  • The paper introduces a curated audio-caption corpus with 1,106 descriptions for 706 music recordings to enhance music-language model evaluation.
  • It benchmarks models in music captioning, text-to-music generation, and retrieval using metrics like BLEU, METEOR, FAD, and recall to reveal domain shifts.
  • The dataset’s open licensing and rich annotations offer a robust foundation to mitigate overfitting and drive generalizable research.

An Overview of the Song Describer Dataset for Music-and-LLM Evaluation

The paper introduces the Song Describer Dataset (SDD), a curated corpus of audio-caption pairs aimed at facilitating the evaluation of music-and-language (M&L) models. This dataset addresses a critical challenge in M&L research: the lack of openly accessible and richly annotated datasets that allow for systematic evaluation across diverse tasks like music captioning, text-to-music generation, and music-to-language retrieval.

Dataset Composition and Characteristics

The Song Describer Dataset comprises 1,106 human-written descriptions associated with 706 freely available music recordings. These recordings are covered under Creative Commons licenses, ensuring that the dataset is open for use within the research community. Captions in the SDD dataset are rich and descriptive, covering various musical dimensions such as instrumentation, genre, and mood.

SDD stands out for its annotations on extended track segments, with most tracks lasting two minutes, unlike the shorter clips found in comparable datasets like MusicCaps and YT8M-MusicTextClips. SDD recordings are accompanied by persistent, openly licensed audio, thereby mitigating issues related to data persistence often faced with datasets relying on platforms like YouTube. Furthermore, SDD includes contributions from a diverse pool of annotators, providing varied perspectives that enrich the dataset’s descriptive quality.

Tasks and Evaluation

The paper demonstrates SDD's utility by benchmarking several state-of-the-art models. These models are evaluated across several tasks:

  1. Music Captioning: Output from models such as MusCaps and LP-MusicCaps was assessed using automatic metrics like BLEU, METEOR, and BERT-score. Both models exhibited significant performance drop when evaluated out-of-domain using SDD, underlining the necessity of diverse evaluation datasets.
  2. Text-to-Music Generation: Using metrics including Frechet Audio Distance (FAD) and Inception Score (IS), models like Riffusion and AudioLDM were evaluated. While these models performed variably across metrics, the comparison between datasets illustrated important considerations for evaluating semantic relevance and audio quality concurrently.
  3. Text-to-Music Retrieval: Here, models such as TTMR and CLAP were assessed using information retrieval metrics like Recall at k and Median Rank (MedR). CLAP outperformed TTMR consistently across both datasets, yet relative performance demonstrated potential biases of in-distribution testing.

Implications and Future Directions

The introduction of SDD holds significant implications for both practical implementations and theoretical advancements in music-and-language research. By providing a well-documented and publicly accessible dataset, the authors give researchers a new tool for benchmarking models against real-world data. SDD helps to mitigate overfitting concerns prevalent when models are exclusively trained and tested on similar data distributions, as seen with private datasets.

The authors note the limitations in the scale of current data and highlight the value of incorporating more varied musical cultures to further enrich the dataset. Moreover, future SDD iterations may include additional data to solidify its reliability as a standard benchmarking tool.

Conclusion

The Song Describer Dataset presents a well-constructed resource for the M&L research community, offering enhanced evaluation capabilities across key model tasks. By providing diverse, high-quality data, SDD aligns with the broader objective of innovating in multimodal music understanding and generation, paving the way for more robust and generalizable machine learning applications in this domain.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.