Social 3D Scene Graphs

Updated 10 February 2026

Social 3D Scene Graphs are advanced 3D representations that explicitly model humans, their attributes, and context-dependent interactions with objects.
They fuse visual and language models with graph neural networks to integrate spatial, activity, and behavioral cues for robust scene understanding.
Empirical evaluations showcase significant gains in relation prediction and query answering, enabling socially intelligent and compliant robot behavior.

A Social 3D Scene Graph (Social 3DSG) is an augmented scene representation that enriches conventional 3D Scene Graphs by explicitly modeling humans, their physical attributes, activities, and context-dependent relationships with both other humans and objects. This structure supports advanced understanding and reasoning in embodied AI settings, particularly for service robots that require socially aware interaction and navigation in complex environments. Social 3DSGs are characterized by open-vocabulary relation modeling, attribute-rich human nodes, and support for both local and remote activities, leveraging vision-LLMs (VLMs) and LLMs for comprehensive semantic integration (Bartoli et al., 29 Sep 2025).

1. Formal Structure and Components

The Social 3D Scene Graph is formally defined as

$\mathcal{G} = (\mathcal{V}, \mathcal{E}, \mathcal{A}, \mathcal{R})$

where:

$\mathcal{V} = \mathcal{V}_o \cup \mathcal{V}_h$ partitions the node set into objects ( $\mathcal{V}_o$ ) and humans ( $\mathcal{V}_h$ ).
$\mathcal{E} = \mathcal{E}_s \cup \mathcal{E}_a$ encompasses both spatial edges ( $\mathcal{E}_s$ , e.g., "next to", "on top of") and activity edges ( $\mathcal{E}_a$ , e.g., "sitting", "speaking to").
$\mathcal{A}$ is an attribute mapping assigning each node a feature vector.
$\mathcal{R}$ is an open-vocabulary set of relation labels for edge semantics.

Node Features:

Object nodes $o_j$ have features $\mathcal{V} = \mathcal{V}_o \cup \mathcal{V}_h$ 0, with $\mathcal{V} = \mathcal{V}_o \cup \mathcal{V}_h$ 1 (3D geometry descriptors) and $\mathcal{V} = \mathcal{V}_o \cup \mathcal{V}_h$ 2 (visual embedding from the VLM).
Human nodes $\mathcal{V} = \mathcal{V}_o \cup \mathcal{V}_h$ 3 use richer features $\mathcal{V} = \mathcal{V}_o \cup \mathcal{V}_h$ 4, covering 3D position, 6D pose (from 6DRepNet360), and a behavior embedding generated via GPT-5.

Edge Features:

Activity and spatial relations are captured by edge attributes, including distance, orientation difference $\mathcal{V} = \mathcal{V}_o \cup \mathcal{V}_h$ 5, and occlusion ratio $\mathcal{V} = \mathcal{V}_o \cup \mathcal{V}_h$ 6.

2. Human-Centric Representation

For each human node, Social 3DSGs encode:

Posture (upright, leaning, lying)
Gaze (direct, averted)
Physical state (sitting, standing, resting)
Action verbs (from an 11-class taxonomy: LISTEN, SPEAK, READ, USE, INTERACT, COOK, REST, etc.)

The core behavioral embedding $\mathcal{V} = \mathcal{V}_o \cup \mathcal{V}_h$ 7 is obtained through a VLM pipeline combining annotated frames and behavior prompts via GPT-5:

$\mathcal{V} = \mathcal{V}_o \cup \mathcal{V}_h$ 8

Head-pose is represented as $\mathcal{V} = \mathcal{V}_o \cup \mathcal{V}_h$ 9, i.e., Euler angles and centroid depth, yielding a 4D feature. These features collectively enable nuanced social state modeling.

3. Open-Vocabulary Relation Modeling

Social 3DSGs implement open-set relation detection by aligning textual and visual embeddings:

For each relation label $\mathcal{V}_o$ 0, compute a text embedding $\mathcal{V}_o$ 1 (using GPT-5).
For a candidate node pair $\mathcal{V}_o$ 2, obtain a visual-context embedding $\mathcal{V}_o$ 3 using VLM encodings over joint segmentation masks.

A contrastive alignment score determines relation plausibility:

$\mathcal{V}_o$ 4

Relations with maximum confidence $\mathcal{V}_o$ 5 above threshold are retained. The model handles both "local" and "remote" activities—remote detection leverages a 3D frustum context estimator and LLM-based activity solver for validation.

4. Graph Construction, Learning, and Inference

Feature fusion assembles node and edge descriptors, providing:

$\mathcal{V}_o$ 6

A graph neural network (GNN) with $\mathcal{V}_o$ 7 rounds of message passing implements joint reasoning:

$\mathcal{V}_o$ 8

with initial embedding $\mathcal{V}_o$ 9.

Optimization uses a combination of:

Cross-entropy loss for edge relation classification,
Contrastive alignment loss,
A pruning regularizer (retaining edges with $\mathcal{V}_h$ 0, where $\mathcal{V}_h$ 1).

5. Benchmark and Empirical Evaluation

The SocialGraph3D benchmark consists of:

Eight Unity-based home scene environments
42 human instances and 110 ground-truth human–entity relations
Registered RGB-D frames, masks, and full point clouds

Query Types: 80 total—spatial (e.g., "Who is next to the lamp?"), activity (e.g., "Who is reading on the sofa?"), and functional (e.g., "Who is likely to want to eat?").

Metrics:

Relation prediction: $\mathcal{V}_h$ 2; precise definitions follow standard IR conventions.
Activity detection: IoU $\mathcal{V}_h$ 3 and label/frame match for true positive.
Scene-query retrieval: ratio of correct retrieved answers to total gold points.

Results:

On activity-relation prediction ("Tab. 2"): The ReaSoN architecture achieved 57.6% precision, 76.9% recall, 65.9% F1, outperforming baselines by +33 percentage points in F1 (relative to CGH).
For social-scene query answering ("Tab. 3"), ReaSoN attained means of 79.8% (spatial), 81.9% (activity), and 64.6% (functional), with a mean score of 75.4%. This exceeds the next best (CLIP_FAV: 57.2%) by over 18 percentage points.
Explicit representation of human-activity edges and compact serialization enable significant gains on complex, LLM-mediated queries.

<table> <thead> <tr> <th>Method</th> <th>Precision (%)</th> <th>Recall (%)</th> <th>F1 (%)</th> </tr> </thead> <tbody> <tr><td>CGH</td><td\>60.0</td><td\>22.2</td><td\>32.4</td></tr> <tr><td>ReaSoN w/o BD</td><td\>45.7</td><td\>34.3</td><td\>39.2</td></tr> <tr><td>ReaSoN</td><td\>57.6</td><td\>76.9</td><td\>65.9</td></tr> </tbody> </table>

6. Applications and Socially Intelligent Robot Behavior

The Social 3DSG architecture supports downstream tasks such as human activity prediction, context-aware service robot planning, and socially compliant robot navigation. For example, inserting a "SPEAK" edge between two conversing humans enables the definition of a "social-cost" field for spatial planners:

$\mathcal{V}_h$ 4

This cost-field ensures that navigation trajectories avoid violating conversational spaces, yielding re-planned robot paths that respect implicit social boundaries.

In summary, Social 3DSGs extend classical semantic scene representations by integrating open-vocabulary human–activity relations, leveraging advances in vision–LLMs and LLMs. This provides a robust foundation for socially competent robot perception, reasoning, and interaction, with empirical evidence showing substantial improvements in both relation prediction and complex 3D scene query answering (Bartoli et al., 29 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Social 3D Scene Graphs: Modeling Human Actions and Relations for Interactive Service Robots (2025)

Topic to Video (Beta)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Social 3D Scene Graphs.