SongEval: A Benchmark Dataset for Song Aesthetics Evaluation

Published 16 May 2025 in eess.AS | (2505.10793v1)

Abstract: Aesthetics serve as an implicit and important criterion in song generation tasks that reflect human perception beyond objective metrics. However, evaluating the aesthetics of generated songs remains a fundamental challenge, as the appreciation of music is highly subjective. Existing evaluation metrics, such as embedding-based distances, are limited in reflecting the subjective and perceptual aspects that define musical appeal. To address this issue, we introduce SongEval, the first open-source, large-scale benchmark dataset for evaluating the aesthetics of full-length songs. SongEval includes over 2,399 songs in full length, summing up to more than 140 hours, with aesthetic ratings from 16 professional annotators with musical backgrounds. Each song is evaluated across five key dimensions: overall coherence, memorability, naturalness of vocal breathing and phrasing, clarity of song structure, and overall musicality. The dataset covers both English and Chinese songs, spanning nine mainstream genres. Moreover, to assess the effectiveness of song aesthetic evaluation, we conduct experiments using SongEval to predict aesthetic scores and demonstrate better performance than existing objective evaluation metrics in predicting human-perceived musical quality.

Abstract PDF Upgrade to Chat

Summary

The paper introduces SongEval, a dataset that provides five-dimensional expert evaluations for full-length songs.
It employs professional annotators to rate overall coherence, memorability, vocal phrasing, song structure, and musicality.
Results show that models trained on SongEval outperform traditional audio metrics, advancing music generation evaluation.

Introduction to SongEval

The paper introduces SongEval, a pioneering dataset designed to evaluate the aesthetics of full-length songs, addressing the limitations of existing metrics in capturing subjective musical appeal. It includes over 2,399 songs, rated by professional annotators across five aesthetic dimensions: overall coherence, memorability, naturalness of vocal phrasing, clarity of song structure, and overall musicality.

Dataset Characteristics

SongEval encompasses songs in both English and Chinese across nine genres, providing a comprehensive resource for assessing musical generation models. Each song is annotated by experts, ensuring high reliability in the evaluation. The dataset’s diversity spans various languages and music styles, making it versatile for different musical applications.

Figure 1: Aesthetic evaluation dimensions and structural components of a song. (a) Structural components of a song. (b) Five aesthetic dimensions used in SongEval for full-length song evaluation.

The data collection process involves generating lyrics and genre-aligned prompts, followed by full-length song synthesis using established commercial systems. This ensures diversity in vocal and instrumental characteristics.

Figure 2: The data collection pipeline of SongEval. Lyrics are an optional input, as some commercial systems can generate songs using only a genre prompt.

Aesthetic Annotation

Each song is evaluated across five key dimensions, providing a rich framework for understanding musical quality beyond traditional metrics. The annotated dimensions are designed to capture subjective qualities that influence musical perception significantly.

Figure 3: Distribution of overall subjective scores over five evaluation dimensions.

The annotation process utilizes professional evaluators, ensuring the dataset reflects expert consensus on musical aesthetics. This multidimensional approach enhances the reliability of aesthetic evaluation, facilitating nuanced assessments of generative models.

Experimental Setup

SongEval supports training aesthetic prediction models across multiple dimensions, using advanced architectures like MOSNet, LDNet, SSL-based models, and UTMOS-based frameworks. These systems demonstrate robust performance in predicting aesthetic attributes, highlighting SongEval’s effectiveness in modeling human musical perception.

Comparison with Objective Metrics

Models trained on SongEval outperform conventional audio metrics in correlating human-perceived aesthetics, underscoring SongEval's value in comprehensive song evaluation.

Figure 4: Violin plots of the aesthetic evaluation results between human annotation and different prediction systems.

Figure 5: Duration distribution across different languages and generation models. Since the songs generated by DiffRhythm have a fixed duration of 285 seconds, a noticeable concentration of songs around the four-minute mark in the distribution.

Conclusion

SongEval sets a new benchmark in song aesthetics evaluation, offering a reliable, professional standard for examining generative models. This dataset provides critical insights into musical quality, offering a foundational resource for advancing music generation technology and evaluation methods.

Limitations and Future Work

Though SongEval provides a robust framework for aesthetic evaluation, it is limited by potential overlaps between dimensions. Future research aims to refine evaluative tools, enhancing model predictions across diverse music styles and genres.

Figure 6: Screenshot of subjective annotation interface used for evaluating musical aesthetics.

This comprehensive dataset serves as a valuable resource for academia and industry, fostering developments in automatic song aesthetic evaluation and improving generative model capabilities for music production.