Panoptic Scene Graph Formalisms

Updated 24 January 2026

Panoptic scene graph formalism is a unified representation that fuses per-pixel segmentation masks with labeled graph edges capturing spatial, temporal, and semantic relations.
It generalizes traditional scene graphs by integrating dynamic elements like temporal node tubes and transformer-based relation prediction for 3D/4D environments.
The approach enables practical applications such as urban analytics, embodied spatial reasoning, and video event understanding through interpretable scene querying.

A panoptic scene graph formalism defines a unified structured representation that simultaneously grounds “thing” and “stuff” entities with panoptic segmentation masks and encodes their spatial, temporal, and semantic relations as labeled graph edges. This formalism generalizes standard scene graphs by integrating fine-grained per-pixel segmentation (panoptic) with explicit predicate relations, supporting both static and dynamic (video, 3D+time) environments. Panoptic scene graphs enable holistic visual understanding, actionable 3D/4D mapping, and interpretable querying of spatial and event-based interactions among entities.

1. Core Representational Structure

Panoptic scene graph (PSG) formalisms define a graph $G = (V, E, \Omega)$ where:

$V = \{ v_1, ..., v_N \}$ is a set of nodes, each corresponding to a unique entity. In a single image, $v_i$ is either a "thing" (object) or "stuff" (background material). In a video or 4D stream, each node is a temporally tracked instance (a "mask tube") or a volumetric segment (Yang et al., 2023, Yang et al., 2024, Wu et al., 19 Mar 2025).
$E$ is a set of directed edges $E \subset V \times R \times V$ , where $R$ is a vocabulary of relation predicates. Each edge $(u, r, v)$ signifies that $u$ stands in relation $r$ to $v$ , possibly over a time interval $V = \{ v_1, ..., v_N \}$ 0.
$V = \{ v_1, ..., v_N \}$ 1 is an optional set of auxiliary per-node and per-edge attributes: category labels, bounding boxes, panoptic masks, geometric cues, appearance and language embeddings, and temporal information (Liu et al., 22 Dec 2025).

Nodes are associated with high-dimensional scene attributes—mask, class label, spatial extent, and often textual/visual features. In 3D/4D/volumetric settings, nodes may also include geometry, depth, or articulated part labels (Han et al., 2021, Yang et al., 2024, Wu et al., 19 Mar 2025).

Edges represent both instantaneous (spatial) and temporally extended (event) relations, generalizing classical image-based scene graphs to dynamic scenarios. Edge attributes may encode category, confidence, and geometric or temporal context.

2. Extensions from Static to Dynamic and 4D PSGs

Classical 2D PSGs restrict $V = \{ v_1, ..., v_N \}$ 2 to per-image segments, $V = \{ v_1, ..., v_N \}$ 3 to spatially local predicates, and mask notation to single frames (Liu et al., 22 Dec 2025). Panoptic video scene graphs (PVSG) and 4D panoptic scene graphs (4D-PSG, PSG-4D) generalize this:

Temporal Node Tubes: Each node $V = \{ v_1, ..., v_N \}$ 4 is grounded by a time-indexed sequence of masks $V = \{ v_1, ..., v_N \}$ 5 (a mask tube) or volumetric segment $V = \{ v_1, ..., v_N \}$ 6 in $V = \{ v_1, ..., v_N \}$ 7 or $V = \{ v_1, ..., v_N \}$ 8 (for RGB-D) (Yang et al., 2024, Yang et al., 2023, Ruschel et al., 20 Nov 2025, Wu et al., 19 Mar 2025).
Temporal Relations: Edges take the form $V = \{ v_1, ..., v_N \}$ 9 with discrete predicate $v_i$ 0 and explicit temporal window, capturing both spatial and long-term interactions (e.g., “enters”, “carries”) (Yang et al., 2023, Yang et al., 2024, Wu et al., 19 Mar 2025).
Node Attributes: Dynamic PSGs/4D-PSGs may associate 3D position, appearance tube, per-frame query tokens, and open vocabulary class labels with each $v_i$ 1 (Yang et al., 2024, Wu et al., 19 Mar 2025).

This generalization supports spatiotemporal reasoning over objects, stuff, and events.

3. Algorithmic Frameworks and Model Architectures

Panoptic scene graph generation is implemented as a multistage pipeline or joint model that includes:

Panoptic Segmentation: Mask2Former or analogous backbone segments frames into panoptic masks; in video and 4D, these masks are temporally linked via embedding-based tracking (e.g., UniTrack, dynamic matching over frame queries) (Yang et al., 2023, Yang et al., 2024).
Node-Edge Construction: Feature pooling and embedding yield rich node and edge descriptors: pooled visual features, language embeddings (e.g., SBERT), geometric attributes, and class label projections (Liu et al., 22 Dec 2025, Yang et al., 2024).
Relation Prediction: For candidate node pairs, spatial transformers and temporal transformers aggregate cross-entity context, followed by MLPs or LLM heads that classify spatial and temporal predicates. Some models further employ set-based transformers for promptable interactive relation discovery (e.g., Click2Graph DIDM) (Ruschel et al., 20 Nov 2025).
Interactive and Open-vocabulary Extensions: Promptable backbones (e.g., SAM2) enable user guidance and open-vocabulary SG parsing via CLIP-style contrastive terms or LLMs (Ruschel et al., 20 Nov 2025, Wu et al., 19 Mar 2025, Liu et al., 22 Dec 2025).

A representative example is PSG4DFormer, which applies spatial transformer encoding per frame, temporal encoding along each object tube, and relation MLPs over node-pair embeddings (Yang et al., 2024).

4. Loss Functions and Training Objectives

The total training objective in PSG formalism is a weighted sum of segmentation, tracking, relation, and auxiliary losses:

Segmentation Loss ( $v_i$ 2): Cross-entropy and mask Dice/loss for panoptic segmentation (e.g., Mask2Former/Mask R-CNN heads).
Tracking Loss ( $v_i$ 3): Embedding or association loss aligning node tubes across frames (typically, matching loss or contrastive loss) (Yang et al., 2023, Yang et al., 2024).
Relation Loss ( $v_i$ 4): Cross-entropy or BCE on predicted predicate categories across node pairs and timepoints.
Interaction Discovery / Prompt Regression Loss: L2 loss between predicted and ground truth interaction points under human guidance (Ruschel et al., 20 Nov 2025).
Open-set Loss and Graph Autoencoding: CLIP-style contrastive objective for open-vocabulary recognition and self-supervised node/edge feature reconstruction for compact scene representations (Liu et al., 22 Dec 2025).

Typical joint objective format:

$v_i$ 5

Hungarian matching or differentiable assignment aligns predicted and ground truth node/edge sets before loss aggregation.

5. Evaluation Metrics

Evaluation protocols for PSG and its extensions emphasize both mask accuracy and semantic/relational correctness. Major metrics include:

Metric Name	Definition/Formula	Context
Recall@K (R@K)	Fraction of correct triplets in top-K predictions; triplets must match subject, predicate, and object with mask/tube IoU ≥ threshold	PVSG, 4D-PSG (Yang et al., 2023, Ruschel et al., 20 Nov 2025, Wu et al., 19 Mar 2025)
Mean Recall@K (mR@K)	Per-predicate-class average of R@K to balance class imbalance	PVSG, 4D-PSG
Spatial Interaction Recall (SpIR)	Proportion of correctly localized subject–object pairs (IoU ≥ threshold), class/predicate agnostic	(Ruschel et al., 20 Nov 2025)
Volume-IoU (vIoU)	$v_i$ 6	Node matching in 4D (Yang et al., 2024, Yang et al., 2023)
Prompt Localization Recall (PLR)	Fraction of predicted prompts lying inside true object masks	User-interactive PVSG (Ruschel et al., 20 Nov 2025)

Additional measures: edge-level average precision (AP_edge), pairwise comparison accuracy (Bradley–Terry score for urban perception) (Liu et al., 22 Dec 2025), support area/contact error (4D/3D graphs) (Han et al., 2021), mask panoptic quality (PQ), and open-set recall (Wu et al., 19 Mar 2025).

6. Applications and Generalization of PSG Formalism

The panoptic scene graph paradigm underpins advanced tasks and architectures in:

Urban perception and analytics: PSGs parsed from street-view images (OpenPSG) yield relational cues for perception prediction, generalizing across cities and clarifying region-specific patterns (e.g., "car parked on sidewalk") (Liu et al., 22 Dec 2025).
Embodied spatial reasoning: PSGs with volumetric/physical constraints enable actionable scene construction for robotics (URDF export, virtual environment import), encoding support, collision, and affordance structure (Han et al., 2021).
Video and dynamic event understanding: PSGs with temporally localized relationships (e.g., "person carries box during [t1, t2]") drive holistic, temporally-aware video scene graph generation and 4D understanding (Yang et al., 2023, Yang et al., 2024, Wu et al., 19 Mar 2025).
Human-in-the-loop visual querying: Promptable, interactive PSG models enable interpretable, controllable video and image understanding by combining segmentation guidance with automatic relational inference (Ruschel et al., 20 Nov 2025).

Transfer learning between 2D SGs and 4D PSGs, as in 2D→4D visual scene transfer frameworks, leverages abundant image annotations to compensate for 4D label scarcity (Wu et al., 19 Mar 2025).

7. Key Models and Research Directions

Significant PSG research includes:

OpenPSG/GraphMAE pipelines: Open-vocabulary PSG parsing for structured reasoning (Liu et al., 22 Dec 2025).
PSG4DFormer and end-to-end 4D PSG architectures: Stacked spatial and temporal transformers for comprehensive mask and relation inference (Yang et al., 2024).
Click2Graph: User-prompted, interaction-aware temporal PSG generation (Ruschel et al., 20 Nov 2025).
LLM-assisted 4D-PSG: LLMs combined with 3D mask decoders and chained inference for open-vocabulary temporal relation extraction (Wu et al., 19 Mar 2025).
Scene Overlap Graphs (SOGNet): Overlap-specific relation modeling in panoptic segmentation (Yang et al., 2019).
Physics- and affordance-aware volumetric graphs: Embodied scene models supporting simulation and interaction (Han et al., 2021).

Ongoing directions center on generalization to open vocabularies, scaling to large and diverse spatiotemporal datasets, integrating physical and social affordances, and end-to-end joint learning for segmentation, relational inference, and downstream reasoning tasks.

Representative works:

Interactive PVSG and Click2Graph (Ruschel et al., 20 Nov 2025)
4D Panoptic Scene Graphs and PSG4DFormer (Yang et al., 2024, Wu et al., 19 Mar 2025)
Urban perception from panoptic SGs (Liu et al., 22 Dec 2025)
Volumetric/affordance graphs for embodied AI (Han et al., 2021)
SOGNet for panoptic overlap relations (Yang et al., 2019)
Baseline PVSG formalism (Yang et al., 2023)