ToMBench: Benchmarking Theory of Mind in Large Language Models

Published 23 Feb 2024 in cs.CL and cs.AI | (2402.15052v2)

Abstract: Theory of Mind (ToM) is the cognitive capability to perceive and ascribe mental states to oneself and others. Recent research has sparked a debate over whether LLMs exhibit a form of ToM. However, existing ToM evaluations are hindered by challenges such as constrained scope, subjective judgment, and unintended contamination, yielding inadequate assessments. To address this gap, we introduce ToMBench with three key characteristics: a systematic evaluation framework encompassing 8 tasks and 31 abilities in social cognition, a multiple-choice question format to support automated and unbiased evaluation, and a build-from-scratch bilingual inventory to strictly avoid data leakage. Based on ToMBench, we conduct extensive experiments to evaluate the ToM performance of 10 popular LLMs across tasks and abilities. We find that even the most advanced LLMs like GPT-4 lag behind human performance by over 10% points, indicating that LLMs have not achieved a human-level theory of mind yet. Our aim with ToMBench is to enable an efficient and effective evaluation of LLMs' ToM capabilities, thereby facilitating the development of LLMs with inherent social intelligence.

Abstract PDF HTML Upgrade to Chat

References (74)

Citations (4)

View on Semantic Scholar

Summary

The paper introduces the ToMBench framework to benchmark LLMs’ Theory of Mind by evaluating 31 social cognitive abilities using eight automated tasks.
It reveals that state-of-the-art LLMs like GPT-4 lag behind human performance by over 10 percentage points in nuanced social reasoning tasks.
Chain-of-Thought prompting failed to enhance ToM skills, emphasizing the need for improved methodologies in assessing LLM cognitive abilities.

"ToMBench: Benchmarking Theory of Mind in LLMs"

The paper "ToMBench: Benchmarking Theory of Mind in LLMs" introduces a systematic benchmark, T, designed to evaluate the Theory of Mind (ToM) capabilities of LLMs. This benchmarking framework encompasses a wide array of tasks and abilities to address the shortcomings of previous ToM assessments.

ToMBench Framework

Evaluation Framework

T is designed as a comprehensive evaluation framework consisting of eight tasks and thirty-one abilities related to social cognition. The tasks are presented in a multiple-choice question format, allowing for automated and unbiased assessment, and using a bilingual inventory built from scratch to mitigate data leakage.

Systematic Task Design

Figure 1: T is a systematic, automated, and original bilingual ToM benchmark for LLMs, covering 8 tasks and 31 abilities. T contains 2,860 testing samples involving diverse real-world social scenarios.

The framework includes tasks such as the Unexpected Outcome Test, Scalar Implicature Task, and False Belief Task, among others. These tasks, grounded in established psychological frameworks, facilitate a robust evaluation of not only task performance but also specific social cognitive abilities.

Experimentation and Findings

LLMs' Performance

Experiments revealed that state-of-the-art LLMs like GPT-4 lag behind human-level ToM capabilities by over 10 percentage points. This gap is particularly evident in tasks requiring nuanced social understanding, such as the Scalar Implicature Task, which showed the lowest LLM performance due to its reliance on understating quantifiers and implicated meanings.

Comparison with Human Baselines

Despite instances where LLMs outperformed human participants in specific tasks (e.g., false belief tasks), these do not translate into overarching ToM competency. The human baselines demonstrated a more consistent and comprehensive understanding of ToM across varied scenarios.

Prompting Strategies

Evaluation using Chain-of-Thought (CoT) prompting failed to significantly enhance ToM performance. This suggests that while CoT can decompose complex tasks into simpler ones, it does not align well with genuine cognitive reasoning in tasks related to ToM for LLMs.

Analysis of Specific Abilities

Figure 2: The difference between the human and LLM's attentions. Color intensity denotes attention weights.

In analyzing specific abilities, LLMs performed adequately in basic emotion recognition but displayed significant deficiencies in understanding complex beliefs and desires. This performance gap highlights that LLMs struggle with tasks demanding deep cognitive reasoning and understanding beyond surface-level semantics.

Coherent Testing

T also introduced a coherent testing methodology where an LLM must answer all associated questions related to a single story correctly to demonstrate understanding. This more stringent evaluation criterion revealed a larger disparity between machines and humans, further illustrating LLMs' limitations in grasping the full context of social scenarios.

Figure 3: The performance variance under the coherent test.

Conclusion

ToMBench presents a critical advancement in the evaluation of LLMs' social cognitive abilities. By broadening the spectrum of assessed abilities and introducing a robust methodological framework, ToMBench provides a comprehensive toolset for advancing LLMs toward more human-like social intelligence. Future work will need to address the integration of multimodal inputs to further refine and enhance the ToM capabilities of LLMs.