ToM-SSI: Situated Social Interactions in AI

Updated 10 February 2026

ToM-SSI is a computational framework that formalizes theory of mind for AI in dynamic, spatially situated multi-agent interactions.
It leverages multimodal benchmarks combining text, images, and spatial grids to assess perception, belief tracking, and intention inference.
Empirical evaluations reveal that current models struggle with nested reasoning and spatial challenges, guiding future research directions.

ToM-SSI (Theory of Mind in Situated Social Interactions) denotes a family of computational frameworks, benchmarks, and formalizations which evaluate or endow artificial agents with socio-cognitive reasoning capacities in rich, situated, or multi-agent environments. This approach transcends the traditional dyadic, cue-based, text-only settings (e.g., the classic Sally–Anne test), demanding explicit, multimodal, and often recursive mental-state inference across agents engaged in group contexts, spatial reasoning, and dynamic interaction structures. This article surveys formal definitions, benchmark construction, model architectures, evaluation protocols, empirical findings, and current limitations in ToM-SSI from its conceptual emergence to rigorous experimental deployments.

1. Formal Foundations and Notation

ToM-SSI grounds Theory of Mind in multi-agent, spatially-situated, and social interaction contexts, introducing rigorous formalism for agents’ perspectives, belief states, and communicative protocols. Typical formalisms define a set of agents $\mathcal{A} = \{A_1,\ldots,A_n\}$ , each with latent state $S_i$ , information set $I_i \subseteq \Phi$ , and objective $O_i$ (Bortoletto et al., 5 Sep 2025). An agent’s ToM module $\text{ToM}_{i \rightarrow j}$ maps $I_i$ to a probability distribution over $(S_j, I_j, O_j)$ . Recursive inference emerges as agents form beliefs about others’ mental states, possibly nested (i.e., “what $A$ thinks about what $B$ thinks,” etc.) (Alon et al., 31 Mar 2025).

Spatial situatedness is captured by representing agents as occupying cells in a discrete grid $G$ with positions $S_i$ 0. Observations, action history $S_i$ 1, communication reachability (based on $S_i$ 2 distance), and event-driven transitions define the observable and latent state spaces. Communication channels are grounded in spatial proximity, and utility functions are parameterized by agent attitudes (cooperative, obstructive, or mixed) (Bortoletto et al., 5 Sep 2025). Belief tracking over time is expressed as $S_i$ 3 for discrete information items $S_i$ 4.

2. Benchmark Construction and Task Design

Recent ToM-SSI benchmark design reflects three main advancements:

Multimodal Input: Each scene is specified as both a rendered image of the grid and a structured textual prompt that encodes agents, initial knowledge, social context, and task-specific rules.
Multi-agent, Non-dyadic Scenarios: Tasks include up to four agents, with group interactions that span pure cooperation, obstruction, and mixed attitudes. Communication is occasioned by the proximity-based channel and structured event history (Bortoletto et al., 5 Sep 2025).
Social-Cognitive Query Taxonomy: Each scenario is paired with three question types—
- Percepts ( $S_i$ 5): “What does $S_i$ 6 observe?” (yes/no)
- Beliefs ( $S_i$ 7): “Which info does $S_i$ 8 think it lacks?” (multiple choice)
- Intentions ( $S_i$ 9): “Who will $I_i \subseteq \Phi$ 0 approach or what will it communicate next?” (multiple choice)

ToM-SSI includes five core tasks: Cooperative Movement–Single/Concurrent Communication (CMSC/CMCC), Probabilistic Cooperative Communication (PCC), Obstructive Communication (OC), and Mixed Communication (MC). Each is generated programmatically from 121 social context templates, fully balanced across agent and information identities and group geometry (Bortoletto et al., 5 Sep 2025).

Task	# Agents	Communication Type	Social Attitude
CMSC	4	Single-step coop.	Cooperative
CMCC	4	Multi-step coop.	Cooperative
PCC	3	Probabilistic coop.	Cooperative
OC	3	Obstructive	Competitive
MC	3	Mixed	Coop.+Obstructive

3. Model Architectures and Inference Mechanisms

ToM-SSI frameworks operationalize mental-state tracking and inference via explicit probabilistic, neurosymbolic, or neural architectures.

Probabilistic Belief Tracking: Agents maintain a distribution $I_i \subseteq \Phi$ 1 (e.g., over object locations), updating beliefs via Bayes’ rule upon observing events. Higher-order beliefs $I_i \subseteq \Phi$ 2 are recursively updated using observations $I_i \subseteq \Phi$ 3 (Alon et al., 31 Mar 2025).
Neural Recursive Inference: RNNs (GRUs/LSTMs) parameterize belief distributions, receiving streams of observed events and agent-observation masks, outputting posterior distributions over beliefs (Alon et al., 31 Mar 2025).
Explicit ToM Modules: Multimodal architectures such as MToMnet use separate “MindNet” modules for each agent, fusing contextual cues (vision, object locations, gaze, pose) and performing belief prediction via cross-agent communication or re-ranking latent outputs (Bortoletto et al., 2024).
Situated Simulation: The event-driven simulation loop integrates agent movement, communication, and social utility calculations, dynamically updating belief states and informing action policies (Bortoletto et al., 5 Sep 2025).

4. Quantitative Evaluation Protocols and Metrics

ToM-SSI evaluation is grounded in task-specific accuracy and advanced cognitive benchmarks:

Multiple-Choice Accuracy: For each question (P, B, I), performance is measured as percent correct; joint metrics (PB, PBI) assess models’ ability to unify perceptual, cognitive, and intentional inference (Bortoletto et al., 5 Sep 2025).
Human Benchmarking: Human performance (multimodal) reaches $I_i \subseteq \Phi$ 4, whereas state-of-the-art models typically achieve only $I_i \subseteq \Phi$ 5– $I_i \subseteq \Phi$ 6 (Bortoletto et al., 5 Sep 2025). Classic Sally–Anne tests remain a core sanity check; ToM-SSI-enabled models surpass three-year-olds in success rates ( $I_i \subseteq \Phi$ 7 for Sally–Anne vs. $I_i \subseteq \Phi$ 8, $I_i \subseteq \Phi$ 9) (Alon et al., 31 Mar 2025).
Failure Mode Taxonomy: Models frequently succeed in basic perception but fail on second-order beliefs and intention tracking, struggle with nested reasoning in CMCC, and rarely leverage visual information effectively (Bortoletto et al., 5 Sep 2025).
Ablation Analysis: Removing recursive inference or KL divergence terms significantly reduces ToM accuracy; parameter efficiency gains are documented for explicit ToM modules (Bortoletto et al., 2024).

5. Empirical Findings and Observed Limitations

Systematic benchmark evaluations show critical gaps in current models:

Spatial Reasoning Deficits: Models misinterpret adjacency and spatial relations, leading to erroneous perceptual inferences and belief updates.
Multi-agent Perspective-taking: Integrated, nested mental-state tracking across more than two agents remains unreliable, especially for concurrent communication and mixed-attitude settings.
Modality Utilization: Incorporation of images or richer sensory data does not guarantee performance improvement; in some multimodal models, text-only performance exceeds vision+language variants (Bortoletto et al., 5 Sep 2025), though select architectures with explicit cross-modal fusion offer improvements (Bortoletto et al., 2024).
Generality Gap: Overfitting to prompt or scenario structures persists. For “silico-centric” ToM-SSI, LLMs often provide superfluous guidance, failing to recognize the redundancy of instructions for identical clones, despite near-perfect human-centric ToM test performance (Mukherjee et al., 2024).

6. Extensions and Future Research Directions

Ongoing and proposed advances in ToM-SSI include:

Scalable Multi-agent Architectures: Development of symbolic belief graphs, temporally structured belief chains, and scalable cross-attention/fusion mechanisms for $O_i$ 0 agents (Bortoletto et al., 2024).
Event-Driven and Implicit Evaluation: Adoption of violation-of-expectation and dot-perspective tasks from psychology, integrated in automated test suites to distinguish prompted from spontaneous ToM reasoning (Gurney et al., 2024).
Dynamic, Interactive Scenarios: Extension to video or continuous-time simulation, richer group structures, and open-ended interaction domains (Bortoletto et al., 5 Sep 2025).
Integration in Social Agents: Embedding ToM-SSI mechanisms in LLM-based dialogue systems enhances strategic reasoning, long-horizon adaptation, and collaborative goal attainment (Hwang et al., 26 Sep 2025).
Contrastive and Meta-learning Approaches: To address silico-centric failures, research on contrastive losses, self-other alignment checks, and multi-agent learning curricula is ongoing (Mukherjee et al., 2024).

A plausible implication is that truly robust artificial social intelligence will require not only sophisticated belief-modeling architectures but also training regimes specifically targeting the unique inferential demands of situated, multi-agent, multimodal social interaction.

7. Significance and Impact

ToM-SSI establishes a new standard for evaluating and advancing computational social cognition in AI. Its emphasis on multimodality, spatially grounded interaction, and multi-agent recursive reasoning addresses fundamentally under-explored aspects of real-world social intelligence. Benchmarks such as ToM-SSI enable rigorous comparison, highlight the limitations of prompt-only or dyadic models, and provide an empirical basis for claims about machine Theory of Mind. Progress in this domain is poised to drive advances in collaborative robotics, adaptive virtual agents, human-AI teaming, and AI safety through improved interpretability of social reasoning processes (Bortoletto et al., 5 Sep 2025, Alon et al., 31 Mar 2025, Bortoletto et al., 2024, Hwang et al., 26 Sep 2025, Mukherjee et al., 2024).