Papers
Topics
Authors
Recent
Search
2000 character limit reached

Clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents

Published 22 May 2023 in cs.CL | (2305.13455v3)

Abstract: Recent work has proposed a methodology for the systematic evaluation of "Situated Language Understanding Agents"-agents that operate in rich linguistic and non-linguistic contexts-through testing them in carefully constructed interactive settings. Other recent work has argued that LLMs, if suitably set up, can be understood as (simulators of) such agents. A connection suggests itself, which this paper explores: Can LLMs be evaluated meaningfully by exposing them to constrained game-like settings that are built to challenge specific capabilities? As a proof of concept, this paper investigates five interaction settings, showing that current chat-optimised LLMs are, to an extent, capable to follow game-play instructions. Both this capability and the quality of the game play, measured by how well the objectives of the different games are met, follows the development cycle, with newer models performing better. The metrics even for the comparatively simple example games are far from being saturated, suggesting that the proposed instrument will remain to have diagnostic value. Our general framework for implementing and evaluating games with LLMs is available at https://github.com/clembench .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. AlephAlpha. 2023. What is luminous? https://docs.aleph-alpha.com/docs/introduction/luminous/. Accessed: 2023-06-12.
  2. Falcon-40B: an open large language model with state-of-the-art performance.
  3. Jacob Andreas. 2022. Language models as agent models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5769–5779, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. CoRR, abs/2204.05862.
  5. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. CoRR, abs/2302.04023.
  6. Thorsten Brants and Alex Franz. 2006. Web 1T 5-gram Version 1.
  7. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  8. Sparks of artificial general intelligence: Early experiments with GPT-4. CoRR, abs/2303.12712.
  9. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  10. Herbert H Clark and Susan E Brennan. 1991. Grounding in communication. In Perspectives on socially shared cognition., pages 127–149. American Psychological Association.
  11. Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46.
  12. Koala: A dialogue model for academic research. Blog post.
  13. The slurk interaction server framework: Better data for better dialog models. In Proceedings of the Language Resources and Evaluation Conference, pages 4069–4078, Marseille, France. European Language Resources Association.
  14. HuggingFace. 2023. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard. Accessed: 2023-06-12.
  15. Openassistant conversations - democratizing large language model alignment. CoRR, abs/2304.07327.
  16. David Lewis. 1969. Convention. Harvard University Press.
  17. David Lewis. 1979. Scorekeeping in a language game. In Semantics from different points of view, pages 172–187. Springer.
  18. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  19. Evaluating verifiability in generative search engines. CoRR, abs/2304.09848.
  20. LMSYS. 2023. LMSYS chat leaderboard. https://chat.lmsys.org/?leaderboard. Accessed: 2023-06-12.
  21. Brielen Madureira and David Schlangen. 2022. Can visual dialogue models do scorekeeping? exploring how dialogue representations incrementally encode shared knowledge. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 651–664, Dublin, Ireland. Association for Computational Linguistics.
  22. Playing atari with deep reinforcement learning. CoRR, abs/1312.5602.
  23. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  24. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  25. Generative agents: Interactive simulacra of human behavior. CoRR, abs/2304.03442.
  26. A. L. Samuel. 1959. Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3(3):210–229.
  27. David Schlangen. 2023a. Dialogue games for benchmarking language understanding: Motivation, taxonomy, strategy. CoRR, abs/2304.07007.
  28. David Schlangen. 2023b. On general language undertanding. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore. Association for Computational Linguistics.
  29. Hugginggpt: Solving AI tasks with chatgpt and its friends in huggingface. CoRR, abs/2303.17580.
  30. Mastering the game of Go without human knowledge. Nature, 550(7676):354–359.
  31. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. CoRR, abs/2206.04615.
  32. Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, volume 33, pages 3008–3021. Curran Associates, Inc.
  33. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  34. George Yule. 1997. Referential Communication Tasks. Routledge, New York, USA.
  35. Adapting language models for zero-shot learning by meta-tuning on dataset and prompt collections. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2856–2878, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Citations (25)

Summary

  • The paper presents a novel game-based evaluation framework that uses interactive games to assess conversational proficiency in chat-optimized language models.
  • It employs five distinct dialogue games, including Taboo and Wordle variants, to systematically test language understanding, rule adherence, and strategic decision-making.
  • Results show improvements from GPT-3 to GPT-4 while highlighting gaps in human-like interaction, underscoring the need for further model refinements.

Evaluating Interactivity in Chat-Optimized LLMs through Clembench

Introduction

The paper "Clembench: Using Game Play to Evaluate Chat-Optimized LLMs as Conversational Agents" (2305.13455) proposes a novel framework to investigate the capabilities of chat-optimized LLMs (cLLMs). It evaluates these models by employing a series of interactive, language-based games designed to test their understanding and execution of conversational tasks. This approach addresses the systematic exploration of the underlying skills in LLMs, contrasting the anecdotal evidence provided by the unguided exploration of tasks typically encountered in public use. Figure 1

Figure 1: Overview of benchmark results

Methodology and Games

The approach uses a variety of dialogue games aimed at assessing different dimensions of cLLMs as interactive agents. The core idea revolves around placing the model in controlled settings where they need to interact both linguistically and strategically to achieve predetermined goals. Five distinct games were implemented:

  1. Taboo: A word guessing game where one model describes a concept without using certain related words, and another guesses the word. This challenges both linguistic creativity and rule adherence.
  2. Wordle Variants: Traditional and clue-enhanced versions of Wordle were used to assess how well models use linguistic clues and feedback to narrow down guesses. A critic-augmented variant examines collaborative interactions between models in decision-making.
  3. Drawing Instructions: Models communicate instructions to replicate a grid-based image, testing spatial reasoning and communication clarity.
  4. Picture Reference Game: Grids are used as stimuli, where the model must create and interpret unique identifiers for these stimuli, emphasizing analogical reasoning.
  5. Scorekeeping Game: In a slot-filling task, models track shared and private information, simulating conversational grounding. This probes the model's capability to update beliefs and maintain an accurate discourse model.

The framework allows fast evaluation by implementing a Game Master to manage game flow, ensure interactions follow specified formats, and provide feedback iteratively. Figure 2

Figure 2

Figure 2

Figure 2: Overview of \% played games and micro-average quality score for all models and games. Perfect performance in the benchmark would be represented with all markers overlapping in the top right corner.

Results

The results underscored significant performance variations across models and games. Advanced models like GPT-4 demonstrated superior capabilities in adhering to game rules and optimizing interaction quality. However, notable disparities between human-like performance and current model capabilities exist, especially in games demanding strategic foresight or dynamic feedback integration. Interestingly, performance improvement was evident from GPT-3 to GPT-4, highlighting the effectiveness of instruction tuning and RLHF. Figure 3

Figure 3: Closeness Scores Progression for all episodes of GPT-4 play

Implications

The findings from Clembench suggest that cLLMs are capable of simulating interactive agents, but their proficiency is varied across tasks. The disparity between model and expected human performance indicates the need for further refinement in interaction design and learning mechanisms. The clem benchmark provides a structured pathway to evaluate cLLM's contextual understanding and decision-making processes, promising diagnostic value for the foreseeable future.

Future Directions

The paper proposes advancements in expanding the benchmarking framework. This includes multilingual support, more complex interactive scenarios, and hybrid models incorporating multimodal inputs. There is also potential in using the insights gained to fine-tune models for optimized performance in particular tasks, enhancing both practical applications and theoretical understandings of AI capabilities.

Conclusion

"Clembench" establishes a promising strategy to systematically explore the conversational capacities of modern cLLMs. By leveraging interactive language games, it surfaces nuanced insights into the rule adherence, abstract reasoning, and decision-making of these models. As AI continues to evolve, Clembench sets a strong foundation for evaluating and enhancing interactive AI systems, ensuring alignment with both human expectations and domain-specific demands. Ultimately, this work paves the way for deeply interactive AI that resonates seamlessly with human cognitive and communicative patterns, albeit considerable progress is yet necessary.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.