MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models

Published 23 Oct 2024 in cs.CL | (2410.17578v2)

Abstract: As LLMs are now capable of producing fluent and coherent content in languages other than English, it is not imperative to precisely evaluate these non-English outputs. However, when assessing the outputs from mutlilingual LLMs, prior works often employed LLM based evaluators that excel at assessing English outputs, without a thorough examination of whether these evaluators could effectively assess non-English text as well. Moreover, existing benchmarks to test evaluator LLMs (referred to as "meta-evaluation benchmarks") are mostly English-centric. To bridge this gap and examine whether evaluator LLMs can reliably assess the outputs of multilingual LLMs, we introduce MM-Eval, a multilingual meta-evaluation benchmark comprising five core subsets covering 18 languages and a Language Consistency subset spanning 122 languages. A core attribute of MM-Eval is that, instead of merely translating existing English meta-evaluation benchmarks, it is designed with multilingual-specific challenges in mind. Additionally, unlike existing meta-evaluation benchmarks that focus solely on ranking accuracy over pairwise data, MM-Eval also evaluates the consistency and fairness of absolute score values across a wide range of languages. Our results show that existing evaluator LLMs that excel in English contexts have considerable room for improvement when assessing non-English outputs. Furthermore, we find that evaluators are unfair and inconsistent when evaluating lower-resourced languages. Finally, we validate MM-Eval by measuring its correlation with Best-of-N rankings, finding a significantly stronger correlation compared to other meta-evaluation benchmarks. We publicly release our benchmark and code.

Abstract PDF Upgrade to Chat

Citations (4)

View on Semantic Scholar

Summary

The paper introduces MM-Eval, a benchmark evaluating LLMs as judges across 18 languages and six evaluation domains.
It systematically assesses 12 LLMs on 4,981 instances, revealing an average accuracy of 68.9% and key multilingual limitations.
It emphasizes challenges in low-resource languages, highlighting model gaps in Safety and Linguistics tasks that warrant further research.

Evaluating Multilingual Capabilities of LLM Judges and Reward Models with MM-Eval

The academic paper under discussion introduces MM-Eval, a comprehensive multilingual meta-evaluation benchmark designed to assess the reliability and effectiveness of LLMs operating as evaluators, specifically in non-English contexts. MM-Eval addresses the crucial gap in existing benchmarks that predominantly emphasize English, thus providing limited insights into multilingual evaluation capabilities.

Motivation and Design of MM-Eval

LLMs have become instrumental in various evaluation tasks, known as LLM-as-a-Judge, and in reward models used for reinforcement learning frameworks. However, the efficacy of LLMs in non-English settings necessitated the development of a broader benchmark. MM-Eval evaluates 18 languages across six subsets: Chat, Reasoning, Safety, Linguistics, Language Hallucination, and Language Resource. Notably, it includes low-resource languages, offering a more comprehensive evaluation spectrum. The Language Resource subset further extends this by covering 122 languages for broader analysis.

Evaluation Insights

The study evaluates 12 LLMs, comprising both proprietary and open-source models, over 4,981 instances from the MM-Eval benchmark. The findings reveal that these models, with an average accuracy of 68.9%, still have notable margins for improvement. Both proprietary and open-source models demonstrate similar performance, underscoring the competitiveness of open models. However, they encounter significant challenges when assessing non-English or low-resource languages. Specifically, notable performance drops in Safety and Linguistics subsets in low-resource languages highlight deficiencies in handling linguistic intricacies.

Implications and Future Directions

The findings suggest that even state-of-the-art LLMs need enhancements in multilingual evaluation capabilities. The tendency to provide undifferentiated scores in low-resource languages presents a key challenge. Moreover, model feedback often suffers from hallucinations, resulting in flawed evaluation judgments. As a result, further research should focus on training LLMs with diverse and quality multilingual corpora, and incorporating language-specific strengths and nuances, to improve their evaluative prowess.

Looking forward, MM-Eval sets the stage for developing even more comprehensive frameworks that integrate emergent challenges, such as handling code-switching and cultural context understanding. Future advancements in AI should aim for balanced linguistic competency across diverse languages to address global linguistic diversity effectively.

Conclusion

MM-Eval stands as a pivotal tool for the evaluation of LLMs in varied linguistic contexts, identifying critical gaps and guiding future improvements. This benchmark serves both practical purposes in developing more reliable LLM evaluators and theoretical exploration into the nuances of multilingual AI cognition. As the research landscape advances, MM-Eval will guide innovations in cross-lingual natural language processing and expand the horizon of AI applicability.

Markdown Report Issue