The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation
Abstract: We introduce the Song Describer dataset (SDD), a new crowdsourced corpus of high-quality audio-caption pairs, designed for the evaluation of music-and-LLMs. The dataset consists of 1.1k human-written natural language descriptions of 706 music recordings, all publicly accessible and released under Creative Common licenses. To showcase the use of our dataset, we benchmark popular models on three key music-and-language tasks (music captioning, text-to-music generation and music-language retrieval). Our experiments highlight the importance of cross-dataset evaluation and offer insights into how researchers can use SDD to gain a broader understanding of model performance.
- YouTube-8M: A Large-Scale Video Classification Benchmark, Sept. 2016. arXiv:1609.08675 [cs].
- MusicLM: Generating Music From Text, Jan. 2023. arXiv:2301.11325 [cs, eess].
- The million song dataset. In Proceedings of the 12th ISMIR Conference, 2011.
- The MTG-Jamendo Dataset for Automatic Music Tagging. In Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML), 2019.
- What Will it Take to Fix Benchmarking in Natural Language Understanding?, Oct. 2021. arXiv:2104.02145 [cs].
- Simple and Controllable Music Generation, June 2023. arXiv:2306.05284 [cs, eess].
- LP-MusicCaps: LLM-Based Pseudo Music Captioning. In Proceedings of the 24th International Society for Music Information Retrieval (ISMIR) Conference, Milan, July 2023. arXiv:2307.16372 [cs, eess].
- Toward Universal Text-to-Music Retrieval. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022. arXiv:2211.14558 [cs, eess].
- Data-Efficient Playlist Captioning With Musical and Linguistic Knowledge. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11401–11415, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics.
- Datasheets for datasets. Communications of the ACM, 64(12):86–92, Nov. 2021.
- Audio Set: An ontology and human-labeled dataset for audio events. In ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pages 776–780. Institute of Electrical and Electronics Engineers Inc., June 2017. ISSN: 15206149.
- DeLiGAN: Generative Adversarial Networks for Diverse and Limited Data. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4941–4949, Honolulu, HI, July 2017. IEEE.
- MuLan: A Joint Embedding of Music Audio and Natural Language. In 23rd International Society for Music Information Retrieval Conference (ISMIR 2022), 2022.
- Noise2Music: Text-conditioned Music Generation with Diffusion Models, Mar. 2023. arXiv:2302.03917 [cs, eess].
- Fréchet Audio Distance: A Reference-Free Metric for Evaluating Music Enhancement Algorithms. In Interspeech 2019, pages 2350–2354. ISCA, Sept. 2019.
- A. Lavie and A. Agarwal. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of the Second Workshop on Statistical Machine Translation, 2007.
- Conversational Music Retrieval with Synthetic Data. In Second Workshop on Interactive Learning for Natural Language Processing at NeurIPS 2022, 2022.
- C. Y. Lin. Rouge: A package for automatic evaluation of summaries. Proceedings of the workshop on text summarization branches out (WAS 2004), 2004.
- AudioLDM: Text-to-Audio Generation with Latent Diffusion Models, Feb. 2023. arXiv:2301.12503 [cs, eess].
- Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning, Aug. 2023. arXiv:2308.11276 [cs, eess].
- MusCaps: Generating Captions for Music Audio. In 2021 International Joint Conference on Neural Networks (IJCNN). IEEE, 2021. arXiv: 2104.11984.
- Contrastive Audio-Language Learning for Music. In 23rd International Society for Music Information Retrieval Conference (ISMIR 2022), 2022. arXiv: 2208.12208.
- Learning music audio representations via weak language supervision. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022. arXiv: 2112.04214.
- Song Describer: a Platform for Collecting Textual Descriptions of Music Recordings. In Proceedings of the 23nd International Society for Music Information Retrieval Conference, Dec. 2022. Conference Name: Ismir 2022 Hybrid Conference.
- Language-Guided Music Recommendation for Video via Prompt Analogies, June 2023. arXiv:2306.09327 [cs].
- The Musicality of Non-Musicians: An Index for Assessing Musical Sophistication in the General Population. PloS one, 9(2), Feb. 2014. Publisher: Public Library of Science.
- BLEU : a Method for Automatic Evaluation of Machine Translation. Computational Linguistics, 2002.
- Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning. PMLR, 2021.
- Do ImageNet Classifiers Generalize to ImageNet? In Proceedings of the 36th International Conference on Machine Learning, pages 5389–5400. PMLR, May 2019. ISSN: 2640-3498.
- Mousai: Text-to-Music Generation with Long-Context Latent Diffusion, Jan. 2023. arXiv:2301.11757 [cs, eess].
- H. M. Seth Forsgren. Riffusion - Stable diffusion for real-time music generation, 2022.
- When does dough become a bagel?Analyzing the remaining mistakes on ImageNet. In ICML 2022 Shift Happens Workshop, May 2023.
- B. Weck and X. Serra. Data Leakage in Cross-Modal Retrieval Training: A Case Study. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, June 2023.
- Audio-Text Models Do Not Yet Leverage Natural Language, Mar. 2023. arXiv:2303.10667 [cs, eess].
- Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation, Apr. 2023. arXiv:2211.06687 [cs, eess].
- BERTScore: Evaluating Text Generation with BERT. In International Conference on Learning Representations 2020, 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.