CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation
Abstract: Recently, the advent of LLMs has revolutionized generative agents. Among them, Role-Playing Conversational Agents (RPCAs) attract considerable attention due to their ability to emotionally engage users. However, the absence of a comprehensive benchmark impedes progress in this field. To bridge this gap, we introduce CharacterEval, a Chinese benchmark for comprehensive RPCA assessment, complemented by a tailored high-quality dataset. The dataset comprises 1,785 multi-turn role-playing dialogues, encompassing 23,020 examples and featuring 77 characters derived from Chinese novels and scripts. It was carefully constructed, beginning with initial dialogue extraction via GPT-4, followed by rigorous human-led quality control, and enhanced with in-depth character profiles sourced from Baidu Baike. CharacterEval employs a multifaceted evaluation approach, encompassing thirteen targeted metrics on four dimensions. Comprehensive experiments on CharacterEval demonstrate that Chinese LLMs exhibit more promising capabilities than GPT-4 in Chinese role-playing conversation. Source code, data source and reward model will be publicly accessible at https://github.com/morecry/CharacterEval.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109.
- A survey on dialogue systems: Recent advances and new frontiers. Acm Sigkdd Explorations Newsletter, 19(2):25–35.
- Large language models meet harry potter: A dataset for aligning dialogue agents with characters. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8506–8520.
- Bridging the gap between prior and posterior knowledge selection for knowledge-grounded dialogue generation. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 3426–3437.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
- Reinforcement learning for personalized dialogue management. In IEEE/WIC/ACM International Conference on Web Intelligence, pages 59–67.
- A survey for in-context learning. arXiv preprint arXiv:2301.00234.
- S3: Social-network simulation system with large language model-empowered agents. arXiv preprint arXiv:2307.14984.
- Openagi: When llm meets domain experts. arXiv preprint arXiv:2304.04370.
- Chatgpt an enfj, bard an istj: Empirical study on personalities of large language models. arXiv preprint arXiv:2305.19926.
- Chatharuhi: Reviving anime character in reality via large language model. arXiv preprint arXiv:2308.09597.
- Zero-resource knowledge-grounded dialogue generation. Advances in Neural Information Processing Systems, 33:8475–8485.
- A survey on empathetic dialogue systems. Information Fusion, 64:50–70.
- Improving factual consistency between a response and persona facts. arXiv preprint arXiv:2005.00036.
- Isabel Briggs Myers. 1962. The myers-briggs type indicator: Manual (1962).
- OpenAI. 2022. Openai: Introducing chatgpt.
- Keyu Pan and Yawen Zeng. 2023. Do llms possess a personality? making the mbti test an amazing evaluation for large language models. arXiv preprint arXiv:2307.16180.
- Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22.
- Karl Pearson. 1901. Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin philosophical magazine and journal of science, 2(11):559–572.
- A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922.
- Character-llm: A trainable agent for role-playing. arXiv preprint arXiv:2310.10158.
- Pddl planning with pretrained large language models. In NeurIPS 2022 Foundation Models for Decision Making Workshop.
- Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2998–3009.
- Characterchat: Learning towards conversational ai with personalized social support. arXiv preprint arXiv:2308.10278.
- Attention is all you need. Advances in neural information processing systems, 30.
- A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432.
- Does role-playing chatbots capture the character personalities? assessing personality traits for role-playing chatbots. arXiv preprint arXiv:2310.17976.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
- Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. arXiv preprint arXiv:2310.00746.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Improving factual consistency for knowledge-grounded dialogue systems via knowledge enhancement and alignment. arXiv preprint arXiv:2310.08372.
- Deep learning for dialogue systems: Chit-chat and beyond. Foundations and Trends® in Information Retrieval, 15(5):417–589.
- Dynaeval: Unifying turn and dialogue level evaluation. arXiv preprint arXiv:2106.01112.
- Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- Low-resource knowledge-grounded dialogue generation. arXiv preprint arXiv:2002.10348.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
- Personalized dialogue generation with diversified traits. arXiv preprint arXiv:1901.09672.
- A pre-training based personalized dialogue generation model with persona-sparse data. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 9693–9700.
- Less is more: Learning to refine dialogue history for personalized dialogue generation. arXiv preprint arXiv:2204.08128.
- Characterglm: Customizing chinese conversational ai characters with large language models. arXiv preprint arXiv:2311.16832.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.