CulturalToM-VQA: Cross-Cultural VQA

Updated 26 December 2025

CulturalToM-VQA is a benchmark framework that tests vision–language models on culturally grounded Theory of Mind reasoning with diverse social and cultural cues.
It employs expert-curated image curation, multi-layer task taxonomy, and structured question formats to assess nuanced social inferences.
Evaluations using accuracy and macro-F1 metrics highlight challenges in cross-cultural generalization, false belief reasoning, and recursive multi-agent inference.

CulturalToM-VQA refers to Visual Question Answering (VQA) frameworks, models, and benchmarks designed to evaluate and advance the cross-cultural Theory of Mind (ToM) reasoning abilities of vision–LLMs (VLMs) and multimodal LLMs (MLLMs). These benchmarks move beyond Western-centric or purely factual VQA, instead probing the capacity of AI systems to understand, interpret, and reason about culturally grounded visual scenes, including social interactions, rituals, artifacts, and other culturally specific semantic cues (Nazi et al., 19 Dec 2025).

1. Definition and Conceptual Foundation

CulturalToM-VQA benchmarks operationalize the notion of Theory of Mind as the model’s ability to attribute, infer, and explain culturally specific beliefs, intentions, emotions, and norms as expressed through visual input. Unlike conventional VQA tasks which often focus on object recognition or generic factual knowledge, CulturalToM-VQA tasks require models to integrate semantic, affective, and contextual cues—such as attire, rituals, gestures, and interpersonal dynamics—within a culture-dependent framework (Nazi et al., 19 Dec 2025). The result is a multidimensional testbed targeting both surface-level recognition and the deeper cognitive aspects of social and cultural inference.

2. Benchmark Construction and Taxonomy

CulturalToM-VQA benchmarks are designed via a combination of expert curation, automated augmentation, and rigorous verification. A representative instantiation is the CulturalToM-VQA dataset (Nazi et al., 19 Dec 2025), constructed through a VLM-assisted human-in-the-loop pipeline:

Image Curation: Images are sourced from diverse datasets focused on emotionally and culturally rich scenes (e.g., FindingEmo, CulturalVQA, CVQA). Candidates are filtered to ensure the presence of multi-agent interactions and explicit cultural or social cues.
Scene Annotation: Structured descriptions are generated combining narrative summaries, emotional cues (E), theory-of-mind cues (T), and explicit cultural markers (C), typically with VLM automation followed by human expert refinement.
Question Generation: Six ToM task types are probed—Mental State Attribution (MSA), False Belief Reasoning (FBR), Non-literal Communication (NLC), Social Norm Violations (SNV), Perspective Coordination (PC), and Multi-Agent Reasoning (MAR)—each stratified into four levels of complexity from direct perception to recursive multi-agent reasoning.
Question Formats: Questions are multiple-choice with near-miss distractors constructed to demand fine-grained reasoning along both ToM and cultural dimensions.

The benchmark achieves coverage of 41 cultures, 394 images, and 5,095 validated questions, with a taxonomy detailed below:

Task Type	Description	Proportion (%)
Mental State Attribution	Emotion/desire inference	16.9
False Belief Reasoning	Distinguishing beliefs from reality	22.0
Non-literal Communication	Irony, sarcasm, hints	15.3
Social Norm Violations	Cultural faux-pas detection	12.5
Perspective Coordination	Knowledge/visibility tracking	20.3
Multi-Agent Reasoning	Nested/multi-party ToM	13.0

Complexity levels are distributed as 21.7% (Level 1; direct perception), 22.1% (Level 2), 28.4% (Level 3), and 27.8% (Level 4) (Nazi et al., 19 Dec 2025).

3. Evaluation Protocols and Metrics

CulturalToM-VQA adopts standard accuracy and macro-F1 metrics to quantify model performance, with additional diagnostics for per-culture disparity ( $D_{\mathrm{culture}}$ ) and complexity degradation ( $\Delta_{\mathrm{level}}$ ):

Accuracy:

$\mathrm{Accuracy} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}\bigl(\hat y_i = y_i\bigr)$

Macro-F1:

$\mathrm{Macro\text{-}F1} = \frac{1}{C}\sum_{c=1}^{C} \frac{2\,\mathrm{TP}_c}{2\,\mathrm{TP}_c + \mathrm{FP}_c + \mathrm{FN}_c}$

Per-level performance degradation:

$\Delta_{\mathrm{level}} = \mathrm{Accuracy}_{\mathrm{Level}\,k} - \mathrm{Accuracy}_{\mathrm{Level}\,(k-1)}$

Per-culture disparity:

$D_{\mathrm{culture}} = \max_{c}\,A_c - \min_{c}\,A_c$

Baselines include zero-shot models, zero-shot Chain-of-Thought (CoT), and compositional CoT paradigms. Annotator agreement is quantified by Cohen’s $\kappa$ (noted as high), and human expert baselines approach the accuracy ceiling (98.34% on sampled items) (Nazi et al., 19 Dec 2025).

4. Empirical Findings and Error Analysis

State-of-the-art VLMs such as Qwen2.5-VL-7B, Phi-4-6B, and Qwen2-VL-7B reach up to 93.9% accuracy and macro-F1 in zero-shot settings on CulturalToM-VQA (Nazi et al., 19 Dec 2025). Major findings include:

Task-wise Difficulty: Mental State Attribution tasks are nearly solved ( $\sim$ 98%), while False Belief Reasoning remains most challenging (~60% average, range 19–83% across models).
Complexity Effects: Performance is high at Level 1 and Level 2, but models show graded drops on Level 3 (second-order reasoning) and Level 4 (recursive multi-agent), with best models at 56–87% on Level 4.
Cross-cultural Variance: Region-wise accuracy can differ by up to 25 points; some Western and Latin American contexts (>95%) are trivialized by current models, while others (e.g., Argentina, Russia, Nigeria) remain difficult (<75%).
Prompting Effects: Explicit and compositional CoT prompting yields large gains for mid-tier models but only marginal improvements for top-tier systems already near ceiling.

Qualitative error analysis identifies frequent failures in false belief attribution, nuanced norm violations, and roles involving ambiguous or subcultural visual cues.

CulturalToM-VQA differs from related efforts in several key respects:

TCC-Bench (Xu et al., 16 May 2025): Focuses on Traditional Chinese culture with a bilingual, bias-controlled corpus, emphasizing deep symbolic and contextual grounding in eight cultural domains. TCC-Bench robustly mitigates language priors and tests for genuine visual reasoning, but it does not stratify ToM tasks or social cognition taxonomies.
VietMEAgent (Nguyen et al., 12 Nov 2025): Introduces a culturally-specific, explainable pipeline for Vietnamese VQA, leveraging a knowledge-based object detection backbone, structured program generation, and dual-modality explanations. It exemplifies cultural grounding with explicit KB integration and interpretable rationales.
CultureMix (Kim et al., 27 Nov 2025): Probes model robustness in “culture mixing” scenarios—co-occurrence of diverse cultural cues in single scenes—demonstrating that background context can induce high label shift and entropy, revealing significant susceptibility to superficial context cues.

A summary comparison is given below:

Benchmark	Cultural Scope	Task Focus	Social/ToM Coverage
CulturalToM-VQA	40+ cultures, global	ToM stratified, social cues	Explicit ToM taxonomy
TCC-Bench	Chinese, bilingual	Visual/cultural symbolism	Cultural reasoning
VietMEAgent	Vietnamese, explainable	Explanation/programmatic	"Cultural ToM"
CultureMix	30 countries, mixing	Cultural identity preservation	No ToM, context effects

6. Open Challenges and Directions

CulturalToM-VQA exposes enduring challenges for VLMs:

Compositional Social Reasoning: Level-4 (recursive/second-order) ToM tasks and non-Western social contexts remain unsolved for most models.
Cross-cultural Generalization: Large accuracy gaps persist across underrepresented cultures, suggesting the need for targeted data augmentation, synthetic data generation, and culturally diverse instruction tuning (Nyandwi et al., 10 Aug 2025).
Bias Mitigation: Existing pretraining regimens induce WEIRD-skewed priors, while simple scale-out is insufficient for robust cultural inference.
Task Expansion: Extending evaluation to dialogic, temporal, and open-response ToM tasks, as well as integrating dynamic video scenes and multi-turn exchanges, represents a critical research direction (Nazi et al., 19 Dec 2025).

Potential strategies include the integration of symbolic cultural knowledge with neural architectures, multi-turn and chain-of-thought prompting, and explicit per-culture model adapters or fine-tuning with high-quality cultural corpora (Nyandwi et al., 10 Aug 2025).

7. Broader Implications

CulturalToM-VQA benchmarks mark an inflection point in culturally inclusive AI, enabling systematic assessment and fine-grained diagnosis of cross-cultural and social reasoning. Applications range from education and healthcare to social robotics and digital heritage preservation. However, the present benchmarks highlight the incompleteness of cultural coverage and the importance of future work in expanding representation, mitigating annotation and model biases, and incorporating richer forms of multimodal social interaction (Nazi et al., 19 Dec 2025).