Evaluating Responses to Complex Open-Ended Music Reasoning Questions

Develop rigorous methodologies and benchmarks to evaluate the quality of responses produced by multimodal audio–text language models to complex, open-ended musical reasoning questions.

Background

The authors observe that straightforward comparisons are unreliable for evaluating reasoning capabilities, especially as models can produce hallucinated or generic outputs not grounded in the audio. Non-expert human raters may be susceptible to these issues, complicating evaluation.

They design alternative experiments (audio-to-text matching and GPT-4 judgments of musical detail) to mitigate these problems, but emphasize that establishing robust, widely accepted methodologies and benchmarks for complex, open-ended music reasoning remains an unresolved challenge.

References

Evaluating the quality of a models' responses to complex, open-ended questions is an open and unresolved research challenge.

LLark: A Multimodal Instruction-Following Language Model for Music  (2310.07160 - Gardner et al., 2023) in Section 6.4 (Reasoning Tasks)