Papers
Topics
Authors
Recent
Search
2000 character limit reached

ToMBench: Benchmarking Theory of Mind in Large Language Models

Published 23 Feb 2024 in cs.CL and cs.AI | (2402.15052v2)

Abstract: Theory of Mind (ToM) is the cognitive capability to perceive and ascribe mental states to oneself and others. Recent research has sparked a debate over whether LLMs exhibit a form of ToM. However, existing ToM evaluations are hindered by challenges such as constrained scope, subjective judgment, and unintended contamination, yielding inadequate assessments. To address this gap, we introduce ToMBench with three key characteristics: a systematic evaluation framework encompassing 8 tasks and 31 abilities in social cognition, a multiple-choice question format to support automated and unbiased evaluation, and a build-from-scratch bilingual inventory to strictly avoid data leakage. Based on ToMBench, we conduct extensive experiments to evaluate the ToM performance of 10 popular LLMs across tasks and abilities. We find that even the most advanced LLMs like GPT-4 lag behind human performance by over 10% points, indicating that LLMs have not achieved a human-level theory of mind yet. Our aim with ToMBench is to enable an efficient and effective evaluation of LLMs' ToM capabilities, thereby facilitating the development of LLMs with inherent social intelligence.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. KOLMOGOROV AN. 1933. Sulla determinazione empirica di una legge didistribuzione. Giorn Dell’inst Ital Degli Att, 4:89–91.
  3. James N Aronson and Claire Golomb. 1999. Preschoolers’ understanding of pretense and presumption of congruity between action and representation. Developmental Psychology, 35(6):1414.
  4. Qwen technical report. arXiv preprint arXiv:2309.16609.
  5. Baichuan-Inc. 2023. Baichuan 2. Online.
  6. Does the autistic child have a “theory of mind”? Cognition, 21(1):37–46.
  7. Recognition of faux pas by normally developing children and children with asperger syndrome or high-functioning autism. Journal of autism and developmental disorders, 29:407–418.
  8. Systematic review and inventory of theory of mind measures for young children. Frontiers in psychology, 10:2905.
  9. Mark Bennett and Linda Galpert. 1993. Children’s understanding of multiple desires. International Journal of Behavioral Development, 16(1):15–33.
  10. Helene Borke. 1971. Interpersonal perception of young children: Egocentrism or empathy? Developmental psychology, 5(2):263.
  11. Sandra Bosacki and Janet Wilde Astington. 1999. Theory of mind in preadolescence: Relations between social understanding and social competence. Social development, 8(2):237–255.
  12. Cross-cultural differences in adult theory of mind abilities: a comparison of native-english speakers and native-chinese speakers on the self/other differentiation task. Quarterly Journal of Experimental Psychology, 71(12):2665–2676.
  13. Michael Brambring and Doreen Asbrock. 2010. Validity of false belief tasks in blind children. Journal of Autism and Developmental Disorders, 40:1471–1484.
  14. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  15. Longitudinal effects of theory of mind on later peer relations: the role of prosocial behavior. Developmental psychology, 48(1):257.
  16. Stephanie M Carlson and Louis J Moses. 2001. Individual differences in inhibitory control and children’s theory of mind. Child development, 72(4):1032–1053.
  17. Precursors of a theory of mind: A longitudinal study. British Journal of Developmental Psychology, 26(4):561–577.
  18. Schizophrenia, symptomatology and social inference: investigating “theory of mind” in people with schizophrenia. Schizophrenia research, 17(1):5–13.
  19. Jean Decety and Philip L Jackson. 2004. The functional architecture of human empathy. Behavioral and cognitive neuroscience reviews, 3(2):71–100.
  20. Susanne A Denham. 1986. Social cognition, prosocial behavior, and emotion in preschoolers: Contextual validation. Child development, pages 194–201.
  21. Do autism spectrum disorders differ from each other and from non-spectrum disorders on emotion recognition tests? European child & adolescent psychiatry, 10:105–116.
  22. Development of knowledge about the appearance-reality distinction. Monographs of the society for research in child development, pages i–87.
  23. Shahriar Golchin and Mihai Surdeanu. 2023. Time travel in llms: Tracing data contamination in large language models. arXiv preprint arXiv:2308.08493.
  24. Noah D Goodman and Andreas Stuhlmüller. 2013. Knowledge and implicature: Modeling language understanding as social cognition. Topics in cognitive science, 5(1):173–184.
  25. Felice W Gordis et al. 1989. Young children’s understanding of simultaneous conflicting emotions.
  26. Francesca GE Happé. 1994. An advanced test of theory of mind: Understanding of story characters’ thoughts and feelings by able autistic, mentally handicapped, and normal children and adults. Journal of autism and Developmental disorders, 24(2):129–154.
  27. Children’s understanding of the distinction between real and apparent emotion. Child development, pages 895–909.
  28. Ignorance versus false belief: A developmental lag in attribution of epistemic states. Child development, pages 567–582.
  29. Mistral 7b. arXiv preprint arXiv:2310.06825.
  30. Epitome: Experimental protocol inventory for theory of mind evaluation. In First Workshop on Theory of Mind in Communicating Agents.
  31. The accidental transgressor: Morally-relevant theory of mind. Cognition, 119(2):197–215.
  32. Fantom: A benchmark for stress-testing machine theory of mind in interactions. In EMNLP, pages 14397–14413.
  33. Theory-of-mind deficits and causal attributions. British journal of Psychology, 89(2):191–204.
  34. Annotation error detection: Analyzing the past and present for a more coherent future. Computational Linguistics, 49(1):157–198.
  35. Empathy in early childhood: Genetic, environmental, and affective contributions. Annals of the New York Academy of Sciences, 1167(1):103–114.
  36. Anna M Kołodziejczyk and Sandra L Bosacki. 2016. Young-school-aged children’s use of direct and indirect persuasion: role of intentionality understanding. Psychology of Language and Communication, 20(3):292–315.
  37. Michal Kosinski. 2023. Theory of mind may have spontaneously emerged in large language models. arXiv preprint arXiv:2302.02083.
  38. Unmasking clever hans predictors and assessing what machines really learn. Nature communications, 10(1):1096.
  39. Revisiting the evaluation of theory of mind through question answering. In EMNLP.
  40. Changmao Li and Jeffrey Flanigan. 2023. Task contamination: Language models may not be few-shot anymore. arXiv preprint arXiv:2312.16337.
  41. Tomchallenges: A principle-guided dataset and diverse evaluation tasks for exploring theory of mind. arXiv preprint arXiv:2305.15068.
  42. Towards a holistic landscape of situated theory of mind in large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1011–1031.
  43. Andrew N Meltzoff. 1995. Understanding the intentions of others: Re-enactment of intended acts by 18-month-old children. Developmental psychology, 31(5):838.
  44. Mistral AI. 2023. Mixtral of experts: A high quality sparse mixture-of-experts. Online.
  45. Infants determine others’ focus of attention by pragmatics and exclusion. Journal of Cognition and Development, 7(3):411–430.
  46. OpenAI. 2023a. Gpt-3.5-turbo-0613: Function calling, 16k context window, and lower prices. Online.
  47. OpenAI. 2023b. New models and developer products announced at devday. Online.
  48. Josef Perner and Heinz Wimmer. 1985. “john thinks that mary thinks that…” attribution of second-order beliefs by 5-to 10-year-old children. Journal of experimental child psychology, 39(3):437–471.
  49. Keeping the reader’s mind in mind: development of perspective-taking in children’s dictations. Journal of applied developmental psychology, 35(1):35–43.
  50. Infants’ ability to connect gaze and emotional expression to intentional action. Cognition, 85(1):53–78.
  51. Bradford H Pillow. 1989. Early understanding of perception as a source of knowledge. Journal of experimental child psychology, 47(1):116–129.
  52. Francisco Pons and Paul Harris. 2000. Test of emotion comprehension: TEC. University of Oxford.
  53. David Premack and Guy Woodruff. 1978. Does the chimpanzee have a theory of mind? Behavioral and brain sciences, pages 515–526.
  54. François Quesque and Yves Rossetti. 2020. What do theory-of-mind tasks actually measure? theory and practice. Perspectives on Psychological Science, 15(2):384–396.
  55. Betty M Repacholi and Alison Gopnik. 1997. Early reasoning about desires: evidence from 14-and 18-month-olds. Developmental psychology, 33(1):12.
  56. Neural theory-of-mind? on the limits of social intelligence in large lms. ArXiv.
  57. Socialiqa: Commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728.
  58. Clever hans or neural theory of mind? stress testing social reasoning in large language models. arXiv preprint arXiv:2305.14763.
  59. Herbert A Simon and Herbert A Simon. 1977. Spurious correlation: A causal interpretation. Springer.
  60. Theory of mind and peer acceptance in preschool children. British journal of developmental psychology, 20(4):545–564.
  61. Patricia A Smiley. 2001. Intention understanding and partner-sensitive behaviors in young children’s peer interactions. Social Development, 10(3):330–354.
  62. How children tell a lie from a joke: The role of second-order mental state attributions. British journal of developmental psychology, 13(2):191–204.
  63. J Swettenham. 1996. Can children be taught to understand false belief using computers? child psychology & psychiatry & allied disciplines, 37 (2), 157–165.
  64. THUDM. 2023. Chatglm3. Online.
  65. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  66. Tomer Ullman. 2023. Large language models fail on trivial alterations to theory-of-mind tasks. arXiv preprint arXiv:2302.08399.
  67. Theory of mind in large language models: Examining performance of 11 state-of-the-art models vs. children aged 7-10 on advanced tests. arXiv preprint arXiv:2310.20320.
  68. Henry M Wellman and Karen Bartsch. 1988. Young children’s reasoning about beliefs. Cognition, 30(3):239–277.
  69. Frank Wilcoxon. 1947. Individual comparisons of grouped data by ranking methods.
  70. Think twice: Perspective-taking improves large language models’ theory-of-mind capabilities. arXiv preprint arXiv:2311.10227.
  71. Heinz Wimmer and Josef Perner. 1983. Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children’s understanding of deception. Cognition, 13(1):103–128.
  72. Hi-tom: A benchmark for evaluating higher-order theory of mind reasoning in large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10691–10706.
  73. On large language models’ selection bias in multi-choice questions. arXiv preprint arXiv:2309.03882.
  74. How far are large language models from agents with theory-of-mind? arXiv preprint arXiv:2310.03051.
Citations (4)

Summary

  • The paper introduces the ToMBench framework to benchmark LLMs’ Theory of Mind by evaluating 31 social cognitive abilities using eight automated tasks.
  • It reveals that state-of-the-art LLMs like GPT-4 lag behind human performance by over 10 percentage points in nuanced social reasoning tasks.
  • Chain-of-Thought prompting failed to enhance ToM skills, emphasizing the need for improved methodologies in assessing LLM cognitive abilities.

"ToMBench: Benchmarking Theory of Mind in LLMs"

The paper "ToMBench: Benchmarking Theory of Mind in LLMs" introduces a systematic benchmark, T, designed to evaluate the Theory of Mind (ToM) capabilities of LLMs. This benchmarking framework encompasses a wide array of tasks and abilities to address the shortcomings of previous ToM assessments.

ToMBench Framework

Evaluation Framework

T is designed as a comprehensive evaluation framework consisting of eight tasks and thirty-one abilities related to social cognition. The tasks are presented in a multiple-choice question format, allowing for automated and unbiased assessment, and using a bilingual inventory built from scratch to mitigate data leakage.

Systematic Task Design

Figure 1

Figure 1: T is a systematic, automated, and original bilingual ToM benchmark for LLMs, covering 8 tasks and 31 abilities. T contains 2,860 testing samples involving diverse real-world social scenarios.

The framework includes tasks such as the Unexpected Outcome Test, Scalar Implicature Task, and False Belief Task, among others. These tasks, grounded in established psychological frameworks, facilitate a robust evaluation of not only task performance but also specific social cognitive abilities.

Experimentation and Findings

LLMs' Performance

Experiments revealed that state-of-the-art LLMs like GPT-4 lag behind human-level ToM capabilities by over 10 percentage points. This gap is particularly evident in tasks requiring nuanced social understanding, such as the Scalar Implicature Task, which showed the lowest LLM performance due to its reliance on understating quantifiers and implicated meanings.

Comparison with Human Baselines

Despite instances where LLMs outperformed human participants in specific tasks (e.g., false belief tasks), these do not translate into overarching ToM competency. The human baselines demonstrated a more consistent and comprehensive understanding of ToM across varied scenarios.

Prompting Strategies

Evaluation using Chain-of-Thought (CoT) prompting failed to significantly enhance ToM performance. This suggests that while CoT can decompose complex tasks into simpler ones, it does not align well with genuine cognitive reasoning in tasks related to ToM for LLMs.

Analysis of Specific Abilities

Figure 2

Figure 2: The difference between the human and LLM's attentions. Color intensity denotes attention weights.

In analyzing specific abilities, LLMs performed adequately in basic emotion recognition but displayed significant deficiencies in understanding complex beliefs and desires. This performance gap highlights that LLMs struggle with tasks demanding deep cognitive reasoning and understanding beyond surface-level semantics.

Coherent Testing

T also introduced a coherent testing methodology where an LLM must answer all associated questions related to a single story correctly to demonstrate understanding. This more stringent evaluation criterion revealed a larger disparity between machines and humans, further illustrating LLMs' limitations in grasping the full context of social scenarios. Figure 3

Figure 3: The performance variance under the coherent test.

Conclusion

ToMBench presents a critical advancement in the evaluation of LLMs' social cognitive abilities. By broadening the spectrum of assessed abilities and introducing a robust methodological framework, ToMBench provides a comprehensive toolset for advancing LLMs toward more human-like social intelligence. Future work will need to address the integration of multimodal inputs to further refine and enhance the ToM capabilities of LLMs.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 15 likes about this paper.