RoleEval: A Bilingual Role Evaluation Benchmark for Large Language Models

Published 26 Dec 2023 in cs.CL | (2312.16132v2)

Abstract: The rapid evolution of LLMs necessitates effective benchmarks for evaluating their role knowledge, which is essential for establishing connections with the real world and providing more immersive interactions. This paper introduces RoleEval, a bilingual benchmark designed to assess the memorization, utilization, and reasoning capabilities of role knowledge. RoleEval comprises RoleEval-Global (including internationally recognized characters) and RoleEval-Chinese (including characters popular in China), with 6,000 Chinese-English parallel multiple-choice questions focusing on 300 influential people and fictional characters drawn from a variety of domains including celebrities, anime, comics, movies, TV series, games, and fictions. These questions cover basic knowledge and multi-hop reasoning abilities, aiming to systematically probe various aspects such as personal information, relationships, abilities, and experiences of the characters. To maintain high standards, we perform a hybrid quality check process combining both automatic and human verification, ensuring that the questions are diverse, challenging, and discriminative. Our extensive evaluations with RoleEval across various open-source and proprietary LLMs, under both the zero- and few-shot settings, reveal insightful findings. Notably, while GPT-4 outperforms other models on RoleEval-Global, Chinese LLMs excel on RoleEval-Chinese, highlighting significant knowledge distribution differences. We expect that RoleEval would highlight the significance of assessing role knowledge for LLMs across various languages and cultural settings.

Abstract PDF Upgrade to Chat

References (37)

Citations (7)

View on Semantic Scholar

Summary

The paper introduces RoleEval, a bilingual framework that evaluates LLMs on real and fictional role knowledge across diverse cultures.
It employs a hybrid quality assurance process using GPT-3 and human oversight to ensure precise, accurate translations and challenge levels.
Model evaluations reveal that GPT-4 leads in global performance, while Chinese-specific LLMs show strong results in culture-specific role tasks.

Introduction

LLMs have transformed the computational linguistics landscape, demonstrating impressive proficiency in understanding and generating human language. These advancements have opened up new possibilities for AI applications that can interact with users in complex and nuanced ways. Evaluating the role knowledge of these models is crucial as it underpins their ability to maintain coherent and contextually appropriate dialogues, especially in scenarios where character portrayal or personality consistency is required.

Benchmarking Role Knowledge

To benchmark role knowledge in LLMs, a new bilingual evaluation framework called RoleEval was introduced. RoleEval systematically assesses the ability of LLMs to memorize, utilize, and reason with role knowledge, encompassing real-world figures and fictional characters from diverse domains such as celebrities, anime, comics, movies, TV series, games, and fiction. The benchmark includes 6,000 Chinese-English parallel multiple-choice questions, which are divided into two components: RoleEval-Global and RoleEval-Chinese, designed to evaluate LLMs on their understanding of global and China-specific influential characters.

Quality Assurance and Translation

RoleEval structure involves a hybrid quality assurance process combining automatic verification through tools like GPT-3 and human oversight. This meticulous quality check ensures the comprehensiveness and diversity of questions, as well as their discrimination and difficulty levels, making the benchmark robust and challenging. The questions are initially written in Chinese and then translated into English using GPT-4, followed by precise human revisions to maintain the accuracy and integrity of role-related information across languages.

Model Evaluations and Insights

LLMs of varying sizes and languages were put through rigorous zero-shot and few-shot evaluations using RoleEval, uncovering nuanced insights into their performance. GPT-4 leads in RoleEval-Global, while Chinese-specific LLMs like Qwen-72B and Yi-34B perform commendably in RoleEval-Chinese. These findings underscore significant discrepancies in knowledge distribution among models, emphasizing the requirement for further research in cross-lingual and culture-specific role knowledge understanding within LLMs. RoleEval aims to be the stepping stone for future developments in the precise evaluation of role-playing abilities of LLMs.

Markdown Report Issue