How Different AI Chatbots Behave? Benchmarking Large Language Models in Behavioral Economics Games

Published 16 Dec 2024 in cs.AI and cs.CL | (2412.12362v1)

Abstract: The deployment of LLMs in diverse applications requires a thorough understanding of their decision-making strategies and behavioral patterns. As a supplement to a recent study on the behavioral Turing test, this paper presents a comprehensive analysis of five leading LLM-based chatbot families as they navigate a series of behavioral economics games. By benchmarking these AI chatbots, we aim to uncover and document both common and distinct behavioral patterns across a range of scenarios. The findings provide valuable insights into the strategic preferences of each LLM, highlighting potential implications for their deployment in critical decision-making roles.

Abstract PDF HTML Upgrade to Chat

Summary

The paper demonstrates that while LLM chatbots exhibit human-like decision clustering, their behaviors remain narrowly defined compared to broader human variability.
The study applies six behavioral economics games with fifty trials each to measure fairness, trust, risk aversion, and cooperation across five advanced AI models.
Results reveal both strong Turing test performances and significant behavioral inconsistencies, underscoring the need for improved model training and alignment.

Benchmarking LLMs in Behavioral Economics Games

In the field of artificial intelligence, the behavioral tendencies of LLMs are pivotal for their effective deployment in decision-making scenarios across diverse applications. The study titled "How Different AI Chatbots Behave? Benchmarking LLMs in Behavioral Economics Games" contributes to this understanding by scrutinizing the decisions and behavior patterns of five predominant LLM-based AI chatbots—OpenAI's GPT, Meta's Llama, Google's Gemini, Anthropic's Claude, and Mistral. This analysis expands on previously established frameworks, particularly those focusing on OpenAI's ChatGPT, and provides a broader comparison across different models using various behavioral economics games.

Methodology

The researchers employed a suite of six behavioral economics games—Dictator, Ultimatum, Trust, Public Goods, Bomb Risk, and Prisoner’s Dilemma—to evaluate the AI chatbots across dimensions such as altruism, fairness, trust, risk aversion, and cooperation. Each model, including multiple variants, was tasked with generating responses, with fifty valid instances collected per game to construct behavioral profiles for comparison with human data from prior studies.

Key Findings

Several noteworthy outcomes emerged from the study:

Human-like Behavior Capture: All tested chatbots demonstrated the ability to capture specific modes of human behavior, leading to highly concentrated decision distributions. However, these distributions were more narrowly defined than those observed in human populations.
Fairness Emphasis: Chatbots, in general, showed a marked preference for decisions maximizing fairness. This was consistent across models, suggesting a design bias or characteristic common to the LLMs analyzed.
Behavioral Inconsistencies: Despite their advanced capabilities, AI systems exhibited significant inconsistencies in their preferences across different games, raising questions about their generalizability and adaptability in varied contexts.
Turing Test Performance: While chatbots like Meta’s Llama 3.1 405B recorded a relatively high success rate in passing the Turing test with humans, the behavior distributions produced by these AI systems did not fully mirror the diversity seen in human behaviors.

Implications

The study's results suggest that although LLMs are advancing toward more nuanced human-like behaviors, there remains a notable gap in diversity and consistency when compared to human judgment. This has critical implications for areas where LLMs could be employed in decision-making roles that demand understanding and emulation of complex human behaviors.

Furthermore, the study highlights the necessity for continual refinement in LLMs to reduce behavioral inconsistencies and enhance distribution similarity to human behaviors. Such refinements could involve improvements in the models' training processes or the evaluation of larger and more diverse datasets during model development.

Future Prospects

Looking ahead, one potential area of exploration is the development of alignment objectives within LLMs that transcend the confines of specific game scenarios. This could involve machine learning infrastructure that integrates a deeper understanding of human behavioral complexity and variability.

Additionally, the ongoing evolution of model checkpoints, as observed in different versions of GPT and Claude models, indicates that behavioral patterns shift with updates, which researchers need to continually track. Exploring these temporal changes can yield insights useful for iterative enhancements and understanding the trajectory of LLM behavioral adaptations.

In conclusion, this paper underscores the critical intersection of behavioral science and artificial intelligence, highlighting not only where current technologies stand but also their potential paths toward more robust human-comparable decision-making frameworks.

Markdown Report Issue