Musical Source Separation Bake-Off: Comparing Objective Metrics with Human Perception

Published 9 Jul 2025 in eess.AS | (2507.06917v3)

Abstract: Music source separation aims to extract individual sound sources (e.g., vocals, drums, guitar) from a mixed music recording. However, evaluating the quality of separated audio remains challenging, as commonly used metrics like the source-to-distortion ratio (SDR) do not always align with human perception. In this study, we conducted a large-scale listener evaluation on the MUSDB18 test set, collecting approximately 30 ratings per track from seven distinct listener groups. We compared several objective energy-ratio metrics, including legacy measures (BSSEval v4, SI-SDR variants), and embedding-based alternatives (Frechet Audio Distance using CLAP-LAION-music, EnCodec, VGGish, Wave2Vec2, and HuBERT). While SDR remains the best-performing metric for vocal estimates, our results show that the scale-invariant signal-to-artifacts ratio (SI-SAR) better predicts listener ratings for drums and bass stems. Frechet Audio Distance (FAD) computed with the CLAP-LAION-music embedding also performs competitively--achieving Kendall's tau values of 0.25 for drums and 0.19 for bass--matching or surpassing energy-based metrics for those stems. However, none of the embedding-based metrics, including CLAP, correlate positively with human perception for vocal estimates. These findings highlight the need for stem-specific evaluation strategies and suggest that no single metric reliably reflects perceptual quality across all source types. We release our raw listener ratings to support reproducibility and further research.

Abstract PDF Upgrade to Chat

Summary

The paper reveals significant discrepancies between traditional objective metrics and human perception in evaluating music source separation.
It employs a large-scale listener study on the MUSDB18 dataset to compare legacy and deep-learning models using metrics like SI-SAR and FAD.
Results indicate that state-of-the-art systems consistently outperform older methods, highlighting the need for perception-informed evaluation.

Musical Source Separation Bake-Off: Comparative Analysis of Objective Metrics and Human Perception

Introduction

The paper "Musical Source Separation Bake-Off: Comparing Objective Metrics with Human Perception" (2507.06917) explores the field of music source separation, where the task is to isolate individual sound sources, or stems (e.g., vocals, drums, guitar), from a mixed audio recording. This process has far-reaching implications for applications like karaoke and various music information retrieval tasks. Traditionally, evaluation of the separation process relies heavily on objective energy-ratio metrics like the Source-to-Distortion Ratio (SDR), Source-to-Artifacts Ratio (SAR), and Source-to-Interference Ratio (SIR). However, these metrics frequently misalign with subjective human perception, necessitating the exploration of alternative, perception-oriented metrics.

Methodology

The study conducted a large-scale listener evaluation utilizing the MUSDB18 test set, involving approximately 30 ratings per track, aggregated across seven separate listener groups with at least two years of musical experience. The listeners assessed outputs from a variety of separation systems, including both legacy methods and modern deep-learning models. State-of-the-art systems like HTDemucs-ft and SCNet-large were scrutinized against the oracle method IRM1 and legacy systems like Open-Unmix and REP1, with evaluations conducted online through the webMUSHRA platform.

Comparison of Objective Metrics

The work compared several objective metrics, both traditional energy-ratio and more recent embedding-based metrics. For the latter, innovative measures such as Fréchet Audio Distance (FAD) with different embeddings (e.g., CLAP-LAION-music, EnCodec, VGGish, Wave2Vec2, HuBERT) were evaluated. Results highlighted the superiority of the traditional SDR metric for vocal estimates, whereas the Scale-Invariant Signal-to-Artifacts Ratio (SI-SAR) was more predictive for drums and bass.

Figure 1: Rating distribution of state-of-the-art systems, by stem type. Bass estimates often suffer due to masking by other sounds, impacting their perceived quality.

Listener Study Results

The listener study revealed a pronounced preference for outputs from state-of-the-art separation systems over legacy methods. Notably, hidden references were consistently rated highest, followed by systems like SCNet-large and HTDemucs-ft, outshining even the oracle IRM1 method. This could imply the importance of perceivable musical qualities over strict mathematical accuracy in generating ideal ratio masks.

Figure 2: Rating distribution of all stems, by separation system. Listeners prefer state-of-the-art systems over legacy ones.

Correlation Studies

By employing Kendall’s $\tau$ as a correlation metric, the study uncovered that artifact errors more acutely predicted human perception than interference errors across most stem types, except bass. The investigation into SI-SDR and SI-SAR demonstrated their effectiveness, particularly for instrument stems where traditional metrics fell short.

Figure 3: Reweightings of the scale-invariant SDR error terms (SI-SIR vs.\ SI-SAR). Artifact errors better predict perception across multiple stem types.

Despite SDR's dominance in vocal estimates, SI-SAR and FAD with CLAP embeddings showed potential for improved perceptual correlation, emphasizing the need for stem-specific and listener-informed metrics.

Fréchet Audio Distance Results

Interestingly, the study concluded that none of the embedding-based metrics, including those derived from speech-trained models like HuBERT and Wave2Vec2, correlated positively with human perception for vocal separation tasks, despite some success in instruments. Nevertheless, CLAP-LAION-music embeddings led to better results for instrument stems compared to other embeddings.

Conclusion

The paper identifies significant discrepancies between objective and subjective assessments of music source separation quality. It underscores the limitations of current metrics like SDR for evaluating perceptual quality and highlights the growing importance of developing perception-informed metrics such as SI-SAR and FAD for certain instruments. These findings stress the importance of comprehensive, listener-driven evaluation frameworks to guide future development in music source separation technologies. By making their raw listener data publicly available, the authors support transparency and provide a foundation for further research into improving both evaluation methodologies and separation techniques.

Markdown Report Issue