Clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents

Published 22 May 2023 in cs.CL | (2305.13455v3)

Abstract: Recent work has proposed a methodology for the systematic evaluation of "Situated Language Understanding Agents"-agents that operate in rich linguistic and non-linguistic contexts-through testing them in carefully constructed interactive settings. Other recent work has argued that LLMs, if suitably set up, can be understood as (simulators of) such agents. A connection suggests itself, which this paper explores: Can LLMs be evaluated meaningfully by exposing them to constrained game-like settings that are built to challenge specific capabilities? As a proof of concept, this paper investigates five interaction settings, showing that current chat-optimised LLMs are, to an extent, capable to follow game-play instructions. Both this capability and the quality of the game play, measured by how well the objectives of the different games are met, follows the development cycle, with newer models performing better. The metrics even for the comparatively simple example games are far from being saturated, suggesting that the proposed instrument will remain to have diagnostic value. Our general framework for implementing and evaluating games with LLMs is available at https://github.com/clembench .

Abstract PDF HTML Upgrade to Chat

References (35)

Citations (25)

View on Semantic Scholar

Summary

The paper presents a novel game-based evaluation framework that uses interactive games to assess conversational proficiency in chat-optimized language models.
It employs five distinct dialogue games, including Taboo and Wordle variants, to systematically test language understanding, rule adherence, and strategic decision-making.
Results show improvements from GPT-3 to GPT-4 while highlighting gaps in human-like interaction, underscoring the need for further model refinements.

Evaluating Interactivity in Chat-Optimized LLMs through Clembench

Introduction

The paper "Clembench: Using Game Play to Evaluate Chat-Optimized LLMs as Conversational Agents" (2305.13455) proposes a novel framework to investigate the capabilities of chat-optimized LLMs (cLLMs). It evaluates these models by employing a series of interactive, language-based games designed to test their understanding and execution of conversational tasks. This approach addresses the systematic exploration of the underlying skills in LLMs, contrasting the anecdotal evidence provided by the unguided exploration of tasks typically encountered in public use.

Figure 1: Overview of benchmark results

Methodology and Games

The approach uses a variety of dialogue games aimed at assessing different dimensions of cLLMs as interactive agents. The core idea revolves around placing the model in controlled settings where they need to interact both linguistically and strategically to achieve predetermined goals. Five distinct games were implemented:

Taboo: A word guessing game where one model describes a concept without using certain related words, and another guesses the word. This challenges both linguistic creativity and rule adherence.
Wordle Variants: Traditional and clue-enhanced versions of Wordle were used to assess how well models use linguistic clues and feedback to narrow down guesses. A critic-augmented variant examines collaborative interactions between models in decision-making.
Drawing Instructions: Models communicate instructions to replicate a grid-based image, testing spatial reasoning and communication clarity.
Picture Reference Game: Grids are used as stimuli, where the model must create and interpret unique identifiers for these stimuli, emphasizing analogical reasoning.
Scorekeeping Game: In a slot-filling task, models track shared and private information, simulating conversational grounding. This probes the model's capability to update beliefs and maintain an accurate discourse model.

The framework allows fast evaluation by implementing a Game Master to manage game flow, ensure interactions follow specified formats, and provide feedback iteratively.

Figure 2: Overview of \% played games and micro-average quality score for all models and games. Perfect performance in the benchmark would be represented with all markers overlapping in the top right corner.

Results

The results underscored significant performance variations across models and games. Advanced models like GPT-4 demonstrated superior capabilities in adhering to game rules and optimizing interaction quality. However, notable disparities between human-like performance and current model capabilities exist, especially in games demanding strategic foresight or dynamic feedback integration. Interestingly, performance improvement was evident from GPT-3 to GPT-4, highlighting the effectiveness of instruction tuning and RLHF.

Figure 3: Closeness Scores Progression for all episodes of GPT-4 play

Implications

The findings from Clembench suggest that cLLMs are capable of simulating interactive agents, but their proficiency is varied across tasks. The disparity between model and expected human performance indicates the need for further refinement in interaction design and learning mechanisms. The clem benchmark provides a structured pathway to evaluate cLLM's contextual understanding and decision-making processes, promising diagnostic value for the foreseeable future.

Future Directions

The paper proposes advancements in expanding the benchmarking framework. This includes multilingual support, more complex interactive scenarios, and hybrid models incorporating multimodal inputs. There is also potential in using the insights gained to fine-tune models for optimized performance in particular tasks, enhancing both practical applications and theoretical understandings of AI capabilities.

Conclusion

"Clembench" establishes a promising strategy to systematically explore the conversational capacities of modern cLLMs. By leveraging interactive language games, it surfaces nuanced insights into the rule adherence, abstract reasoning, and decision-making of these models. As AI continues to evolve, Clembench sets a strong foundation for evaluating and enhancing interactive AI systems, ensuring alignment with both human expectations and domain-specific demands. Ultimately, this work paves the way for deeply interactive AI that resonates seamlessly with human cognitive and communicative patterns, albeit considerable progress is yet necessary.