Academically intelligent LLMs are not necessarily socially intelligent

Published 11 Mar 2024 in cs.CL and cs.CY | (2403.06591v1)

Abstract: The academic intelligence of LLMs has made remarkable progress in recent times, but their social intelligence performance remains unclear. Inspired by established human social intelligence frameworks, particularly Daniel Goleman's social intelligence theory, we have developed a standardized social intelligence test based on real-world social scenarios to comprehensively assess the social intelligence of LLMs, termed as the Situational Evaluation of Social Intelligence (SESI). We conducted an extensive evaluation with 13 recent popular and state-of-art LLM agents on SESI. The results indicate the social intelligence of LLMs still has significant room for improvement, with superficially friendliness as a primary reason for errors. Moreover, there exists a relatively low correlation between the social intelligence and academic intelligence exhibited by LLMs, suggesting that social intelligence is distinct from academic intelligence for LLMs. Additionally, while it is observed that LLMs can't ``understand'' what social intelligence is, their social intelligence, similar to that of humans, is influenced by social factors.

Abstract PDF HTML Upgrade to Chat

References (62)

Citations (5)

View on Semantic Scholar

Summary

The paper demonstrates that LLMs excel in academic tasks but struggle with nuanced social interactions as measured by SESI.
It finds a weak correlation between academic and social intelligence, highlighting LLMs' inability to adapt to real social cues.
The study shows that personality traits and assigned social roles can enhance LLMs’ social performance, suggesting new training paradigms.

Academically Intelligent LLMs Are Not Necessarily Socially Intelligent

Introduction

This paper investigates the social intelligence capabilities of LLMs despite their demonstrated prowess in academic intelligence tasks. Through the Situational Evaluation of Social Intelligence (SESI), the authors measure LLMs' ability to handle real-world social scenarios. The study reveals that LLMs display notable deficiencies in social intelligence and limited correlation between academic and social intelligence, advocating for the independent investigation of these two forms of intelligence.

Inspired by human social intelligence frameworks, particularly Daniel Goleman's theory, the SESI benchmark was designed to evaluate LLMs' performance in authentic social situations derived from user interactions on platforms like Reddit. SESI tests five subcategories of social intelligence: empathy, social cognition, self-presentation, influence, and concern.

Figure 1: Overview of the situational evaluation of social intelligence.

The SESI benchmark addresses the limitations of existing evaluations by including comprehensive scenarios and dynamic real-world contexts, avoiding the pitfalls of static datasets and potential overfitting in LLMs' training.

Thirteen state-of-the-art LLMs were assessed, revealing that social intelligence is a distinct construct from academic intelligence. The correlation coefficient between SESI scores and academic intelligence metrics was notably lower than that among academic benchmarks alone, affirming that social intelligence requires separate attention.

Figure 2: Heatmap for correlation matrix for social and academic intelligence measures.

LLMs demonstrate a tendency toward superficial friendliness, characterized by fixed pattern responses, which lack adaptability to varied social contexts (Figure 3). Additionally, LLMs show a misunderstanding of social intelligence prompts, often performing worse with higher supposed levels of social intelligence, indicating a fundamental misconception.

Figure 4: Change Ratio in the social intelligence performance of LLM agents following the manipulation of factors.

The study examines the impacts of personality, gender, social role, and perspective on LLMs' social intelligence. LLMs with low agreeableness and high extraversion tend to perform better, likely because these traits mitigate the overly friendly bias of LLMs. Similarly, explicitly assigning a male gender or specific social roles like 'saler' improves LLMs' social intelligence. These effects suggest that aligning LLMs' characteristics with real-world stereotypes can enhance their social responses.

Figure 5: Impact of social factors on social intelligence (SI) performance of LLM agents.

SESI Benchmark Characteristics

SESI provides long, intricate contexts and diverse questions that require nuanced understanding and application of social intelligence (Figure 6). It presents a comprehensive and balanced assessment of social cognitive processes and social facility capabilities, setting it apart from other benchmarks which predominantly focus on social awareness alone.

Figure 6: SESI benchmark statistics.

Conclusion

The research underscores that social intelligence is not inherently aligned with academic intelligence in LLMs, suggesting the need for distinct training paradigms and benchmarks. As LLMs continue to integrate into socially interactive applications, understanding and enhancing their social capabilities remains crucial for reliable and effective human-AI interactions. Future work should focus on refining social intelligence benchmarks and exploring unique training methodologies to improve LLMs' understanding of complex social dynamics.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated list of what remains missing, uncertain, or unexplored in the paper, framed to guide actionable future work:

Ground-truth validity: “Correct” answers are derived from Reddit upvote consensus (top five comments), which may encode popularity, platform-specific norms, and demographic biases rather than socially optimal or ethical judgments. How do results change under expert-annotated or cross-cultural gold standards?
Cultural and domain bias: SESI scenarios are sourced solely from r/relationships (Western, English-speaking, romantic/family-centric). Does performance generalize to other cultures, languages, platforms, and social domains (workplace, healthcare, education, civic contexts)?
Training contamination: No deduplication or leakage analysis is reported. Could models have seen the source posts (or near-duplicates) during pretraining or instruction-tuning, inflating scores? How does SESI performance change after filtering known or likely training data overlaps?
Construction bias via GPT-3.5: The same model family used in evaluation helped generate contexts, summaries, and “reversed” answers, risking stylistic artifacts and favoritism. What is the impact when dataset construction uses human annotators or diverse model ensembles?
Psychometric validation: The benchmark lacks formal reliability and validity evidence (e.g., internal consistency, test–retest reliability, item response theory, factor analyses confirming the five-factor structure, measurement invariance across subgroups/models). Can SESI be psychometrically validated?
Human baselines: No human performance (overall and per-subskill) is reported. How do humans (expert/non-expert, cross-cultural cohorts) perform on SESI, and what is the human–LLM gap?
Multi-turn and outcome-based social facility: SESI is single-turn, multiple-choice. Does model “social facility” hold in interactive, multi-turn settings with dynamic feedback and measurable outcomes (e.g., user satisfaction, conflict resolution success)?
Confound with reading/comprehension load: Long contexts and many agents may conflate social intelligence with reading comprehension and working memory. Can ablations equate or vary context length and entity count to disentangle these factors?
Correlation claims with academic intelligence: The correlation analysis uses a small model sample (n=13) and does not control for confounds (model size, architecture, decoding settings). Are results robust under partial correlations, bootstrapping, and confidence intervals?
Decoding and reproducibility: Temperature/top-p/seed settings, response parsing, and retry policies are not specified. How sensitive are SESI scores to decoding parameters and run-to-run variance across models?
Parsing robustness: The pipeline converts free-form outputs to options; failure modes and parsing errors are not quantified. What is the error rate of the parser, and does it bias model comparisons?
Error taxonomy rigor: The “superficially friendly” diagnosis is based on manual categorization of 50 errors/model without reported inter-annotator agreement or coding protocol. Can a larger, blinded, multi-rater study validate the error taxonomy?
RLHF hypothesis untested: The paper speculates that RLHF induces superficial friendliness but does not test aligned vs base models or alignment ablations. Do de-aligned/base variants or alternative alignment strategies change error patterns?
Persona, gender, role, and perspective manipulations: Effects are shown but interactions, causal mechanisms, and safety implications remain unclear. Are results robust across prompts, seeds, and cultures? Do these manipulations amplify bias or harmful behaviors?
Gender finding validity: “Male” persona performing better conflicts with some human literature; effects vanish with implicitly gendered roles. Is this an artifact of prompts, lexical cues, or training data? How does this vary cross-culturally and with debiased prompts?
Role prompts and alignment with context: Different role-insertion methods show heterogeneous effects, yet mechanisms are unclear. Can a controlled factorial design isolate when role alignment helps vs harms, and why?
Scope and dynamics of SESI: Although claimed “dynamic,” the study uses a single 2023 snapshot. How do results evolve over time, with refreshed posts, and under distribution shifts (temporal drift)?
Answer-option construction: “Least entailed” selection is referenced but the entailment method is unspecified; reversed answers are GPT-generated, risking stylistic cues. Can adversarial filtering and style-matching remove superficial answer cues?
Scoring scheme: Group-consensus scoring can reward popular-but-wrong responses. How do conclusions change with expert scoring, hybrid scoring (expert + crowd), or outcome-based evaluation?
Format effects: Multiple-choice may cue superficial elimination strategies. How do models fare on open-ended generation with human grading or rubric-based automatic scoring?
Safety and ethics: The benchmark leverages sensitive real-world posts. Were consent, anonymization, and content safety safeguards applied? Do models produce harmful advice, and how should SESI integrate safety-aligned scoring?
Fairness and harm analysis: Findings (e.g., low agreeableness improves scores) may incentivize antisocial personas. What are the fairness, bias, and downstream harm implications of optimizing for SESI, and how can guardrails be integrated?
Generalization to other social benchmarks: The paper qualitatively contrasts SESI with SocialIQA/SOTOPIA/EmoBench but provides no quantitative cross-benchmark correlations or joint evaluations. How consistent are model rankings across social-intelligence datasets?
Subskill breakdown and diagnostics: Aggregate SESI scores are emphasized; per-subskill error patterns, item difficulties, and learning curves are not analyzed. Which subskills are bottlenecks, and what targeted interventions help?
Interventions to improve SI: The study diagnoses deficits but does not test training strategies (e.g., supervised fine-tuning on social dialogues, RL from group consensus, debate, constitution-guided alignment). Which approaches most effectively reduce “superficial friendliness”?
Memory and tool use: It remains unknown whether external memory, planning tools, or chain-of-thought reliably improve SESI performance, especially in longer, multi-actor scenarios.
Replicability and openness: Key implementation details (prompts, seeds, parsing rules, entailment models) are not fully specified in the main text; reproducibility of the full pipeline (including construction randomness) is untested.
Interaction effects among factors: The study varies persona, emotion, gender, role, and perspective mostly in isolation. Do combined factors interact nonlinearly, and which interactions are most impactful (e.g., extraversion × role × perspective)?
Measurement invariance across models: It is unclear whether SESI measures the same construct across different architectures. Can invariance tests confirm construct comparability across model families and sizes?

Academically intelligent LLMs are not necessarily socially intelligent

Summary

Academically Intelligent LLMs Are Not Necessarily Socially Intelligent

Introduction

SESI Benchmark Characteristics

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Open Problems

Continue Learning

Authors (5)

Collections

Tweets

Academically intelligent LLMs are not necessarily socially intelligent

Summary

Academically Intelligent LLMs Are Not Necessarily Socially Intelligent

Introduction

Overview of Social Intelligence Evaluation

Analysis of LLMs' Social Intelligence Performance

Influence of Social Factors on LLMs

SESI Benchmark Characteristics

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Open Problems

Continue Learning

Related Papers

Authors (5)

Collections

Tweets