RoleEval: A Bilingual Role Evaluation Benchmark for Large Language Models
Abstract: The rapid evolution of LLMs necessitates effective benchmarks for evaluating their role knowledge, which is essential for establishing connections with the real world and providing more immersive interactions. This paper introduces RoleEval, a bilingual benchmark designed to assess the memorization, utilization, and reasoning capabilities of role knowledge. RoleEval comprises RoleEval-Global (including internationally recognized characters) and RoleEval-Chinese (including characters popular in China), with 6,000 Chinese-English parallel multiple-choice questions focusing on 300 influential people and fictional characters drawn from a variety of domains including celebrities, anime, comics, movies, TV series, games, and fictions. These questions cover basic knowledge and multi-hop reasoning abilities, aiming to systematically probe various aspects such as personal information, relationships, abilities, and experiences of the characters. To maintain high standards, we perform a hybrid quality check process combining both automatic and human verification, ensuring that the questions are diverse, challenging, and discriminative. Our extensive evaluations with RoleEval across various open-source and proprietary LLMs, under both the zero- and few-shot settings, reveal insightful findings. Notably, while GPT-4 outperforms other models on RoleEval-Global, Chinese LLMs excel on RoleEval-Chinese, highlighting significant knowledge distribution differences. We expect that RoleEval would highlight the significance of assessing role knowledge for LLMs across various languages and cultural settings.
- MPCHAT: Towards Multimodal Persona-Grounded Conversation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3354–3377, Toronto, Canada. Association for Computational Linguistics.
- Qwen Technical Report. ArXiv preprint, abs/2309.16609.
- Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling. ArXiv preprint, abs/2304.01373.
- A Survey on Evaluation of Large Language Models. ArXiv preprint, abs/2307.03109.
- Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca. ArXiv preprint, abs/2304.08177.
- GLM: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, Dublin, Ireland. Association for Computational Linguistics.
- A framework for few-shot language model evaluation.
- Evaluating Large Language Models: A Comprehensive Survey. ArXiv preprint, abs/2310.19736.
- Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
- C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. ArXiv preprint, abs/2305.08322.
- AI Alignment: A Comprehensive Survey. ArXiv preprint, abs/2310.19852.
- Mistral 7B. ArXiv preprint, abs/2310.06825.
- ChatHaruhi: Reviving Anime Character in Reality via Large Language Model. ArXiv preprint, abs/2308.09597.
- CMMLU: Measuring massive multitask language understanding in Chinese. ArXiv preprint, abs/2306.09212.
- The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning. ArXiv preprint, abs/2312.01552.
- Towards emotional support dialog systems. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3469–3483, Online. Association for Computational Linguistics.
- OpenAI. 2023. GPT-4 Technical Report. ArXiv preprint, abs/2303.08774.
- Training language models to follow instructions with human feedback. ArXiv preprint, abs/2203.02155.
- Character-LLM: A Trainable Agent for Role-Playing. ArXiv preprint, abs/2310.10158.
- Large Language Model Alignment: A Survey. ArXiv preprint, abs/2309.15025.
- Am I me or you? state-of-the-art dialogue models cannot maintain an identity. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 2367–2387, Seattle, United States. Association for Computational Linguistics.
- LLaMA: Open and Efficient Foundation Language Models. ArXiv preprint, abs/2302.13971.
- Llama 2: Open Foundation and Fine-Tuned Chat Models. ArXiv preprint, abs/2307.09288.
- A Survey on Large Language Model based Autonomous Agents. ArXiv preprint, abs/2308.11432.
- RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models. ArXiv preprint, abs/2310.00746.
- Multi-Party Chat: Conversational Agents in Group Settings with Humans and Models. ArXiv preprint, abs/2304.13835.
- Skywork: A More Open Bilingual Foundation Model. ArXiv preprint, abs/2310.19341.
- Joseph Weizenbaum. 1966. ELIZA—a computer program for the study of natural language communication between man and machine. Communications of the ACM, 9(1):36–45.
- BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. ArXiv preprint, abs/2211.05100.
- The Rise and Potential of Large Language Model Based Agents: A Survey. ArXiv preprint, abs/2309.07864.
- Long time no see! open-domain conversation with long-term persona memory. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2639–2650, Dublin, Ireland. Association for Computational Linguistics.
- Baichuan 2: Open Large-scale Language Models. ArXiv preprint, abs/2309.10305.
- GLM-130B: An Open Bilingual Pre-trained Model. ArXiv preprint, abs/2210.02414.
- Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2204–2213, Melbourne, Australia. Association for Computational Linguistics.
- Personalized Dialogue Generation with Diversified Traits. ArXiv preprint, abs/1901.09672.
- LIMA: Less Is More for Alignment. ArXiv preprint, abs/2305.11206.
- CharacterGLM: Customizing Chinese Conversational AI Characters with Large Language Models. ArXiv preprint, abs/2311.16832.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.