Hierarchical 3D Scene Graphs

Updated 21 January 2026

Hierarchical 3D Scene Graphs are multi-level representations that organize spatial entities from low-level geometry to high-level semantic concepts for both static and dynamic environments.
They are constructed through integrated pipelines combining RGB-D fusion, CNN-based segmentation, clustering, and spatial relation encoding to ensure scalability and interpretability.
These graphs facilitate enhanced robotic perception, navigation, task planning, and language grounding by providing actionable, semantically-rich, and dynamic spatial models.

Hierarchical 3D Scene Graphs formalize multi-resolution, multi-entity representations for complex indoor and outdoor environments, providing the foundation for both static and dynamic spatial modeling, robot perception, task and motion planning, language grounding, and generative scene synthesis. A Hierarchical 3D Scene Graph (3DSG) systematically organizes the environment as a layered, directed graph, where nodes capture spatial entities at varying abstractions—ranging from geometric primitives up to structural, functional, and even dynamical/temporal concepts—with edges encoding semantic, geometric, and functional relations. This paradigm underpins contemporary approaches in embodied AI, large-scale robotic mapping, open-vocabulary grounding, and interactive 3D scene understanding.

1. Formal Structure and Abstraction Levels

A hierarchical 3DSG is defined as a directed, layered graph $G = (V, E, L, A)$ or $G^t$ in time-variant settings, where:

$V = \biguplus_{\ell=1}^L V_\ell$ , with each set $V_\ell$ representing spatial concepts at level $\ell$ (e.g., low-level geometry, objects, rooms, regions, buildings).
$E$ collects intra-level (adjacency), inter-level (containment, part-of), and functional/semantic relations between nodes.
$L$ is the set of abstraction levels, typically (but not exclusively): geometric primitives, objects, structural elements, navigational/functional areas, semantic zones (rooms, floors), and global context (building, environment).
$A$ assigns attributes (geometry, centroid, bounding box, semantic vector, affordances, motion models).

Canonical instantiations:

Indoor: $V^1$ = meshes/point clouds, $V^2$ = objects ( $\rightarrow$ attributes: 3D centroid, class, descriptors), $V^3$ = places (waypoints, connectivity), $V^4$ = rooms, $V^5$ = buildings (Ravichandran et al., 2021, Cheng et al., 19 Mar 2025, Catalano et al., 10 Dec 2025).
Outdoor: objects, terrain-aware places, regions of interest, global root (Samuelson et al., 23 Sep 2025, Samuelson et al., 6 Jun 2025).

Edges encode both hierarchical containment (e.g., "object in room", "room in floor") and peer-level semantic/image/topological relations (e.g., "adjacency", "accessibility", "affordance", "operability" such as in HERO (Wang et al., 17 Dec 2025)).

2. Construction Pipelines and Algorithmic Frameworks

Building a hierarchical 3DSG from perception integrates RGB-D fusion, geometry processing, semantic segmentation, and symbolic abstraction:

Low-Level Geometry Fusion: Multi-frame depth or LiDAR scans are fused by SLAM pipelines to produce globally registered point clouds or TSDF fields; mesh extraction via Marching Cubes provides collision and visualization models (Cheng et al., 19 Mar 2025, Ravichandran et al., 2021).
Instance and Object Segmentation: 2D/3D CNNs (e.g., YOLOv8, VoteNet, FastSAM) provide per-pixel/mask or proposal-based instance segmentation; CLIP/VLM embeddings enrich nodes with semantic features (Samuelson et al., 6 Jun 2025, Linok et al., 16 Jul 2025).
Hierarchical Grouping: Meshes are clustered via connectivity, persistent homology, or agglomerative clustering to yield places, rooms, regions, and buildings. Place graphs and room detection exploit topological skeletonization, such as Voronoi medial axes in free space (Ray et al., 2024, Samuelson et al., 23 Sep 2025).
Edge Construction and Attribute Assignment: Edges are constructed by spatial adjacency, geometric overlap, or functional relation (e.g., "has_part", "operable", "affords" (Rotondi et al., 10 Mar 2025, Wang et al., 17 Dec 2025)); node/edge attributes include pose estimates, semantic/affordance labels, and, in temporal/dynamic settings, motion flow descriptors (Catalano et al., 10 Dec 2025).

Multi-robot or collaborative settings maintain local graphs that are merged via reconciliation proposals, robust pose-graph optimization, and semantic/geometry-based alignment (Chang et al., 2023).

3. Temporal and Dynamic Extensions

Hierarchical 3DSGs are not limited to static scenes. Temporal flow dynamics are integrated via:

Dynamic Node Augmentation: Each navigational node $v_{n,i}^t$ is endowed with a time-indexed motion descriptor $s_i(t) \in \mathbb{R}^{B \times \lambda}$ , storing angularly-binned motion statistics (directionality, count, spectral components via FreMEn) (Catalano et al., 10 Dec 2025).
Ownership Transfer and Consistency: Temporal data is first accumulated in sparse spatial hash cells and later bound to stable scene-graph nodes upon loop closure to ensure robust anchoring under pose-graph correction (Catalano et al., 10 Dec 2025).
Flow-Based Planning and Prediction: Planning operates over augmented graphs, with edge costs or action likelihoods informed by temporally-predicted densities, entropy, and dominant flows—enabling dynamic avoidance of congested or adversarially moving regions (Catalano et al., 10 Dec 2025).
Hierarchical Aggregation: Flow estimates can be recursively aggregated up the scene-graph hierarchy, supporting room-level, floor-level, or building-level predictions without explicit inter-layer message passing.

This approach generalizes beyond the classical agent-centric tracking of dynamic SGs (e.g., CURB-SG) and grid-based maps-of-dynamics by providing rich, scalable, and semantically-grounded dynamic representations.

Hierarchical 3DSGs enable a wide range of embodied intelligence functions:

Navigation and Policy Learning: Hierarchical scene graphs serve as state representations for RL-based navigation. GNNs trained over such graphs leverage node-level features (relative position, occupancy, semantic class, visited flag) and explicit memory (trajectory nodes, visitation flags), yielding significant improvements in object search efficiency, area coverage, and collision reduction over mid-level visuomotor baselines (Ravichandran et al., 2021).
Task and Motion Planning Integration: Scene graphs are systematically converted to symbolic planning domains (e.g., PDDLStream), with effective scalability obtained via (i) pre-pruning irrelevant nodes based on symbolic redundancy and (ii) lazy incremental addition of objects only when physically relevant to the trajectory, accelerating TAMP in large-scale environments (Ray et al., 2024).
Context-Aware Reasoning and Language Grounding: Open-vocabulary object grounding, referential query resolution, and context-aware task planning are supported by fusing multi-modal node attributes and explicit inter/intra-layer relationships. LLMs or VLMs interact with the 3DSG through subgraph embedding, multi-step reasoning, and textual prompt augmentation (Linok et al., 16 Jul 2025, Werby et al., 1 Oct 2025).
Interaction and Affordance-Aware Planning: Explicit modeling of affordances at both object and region level (e.g., "pushable", "graspable", "transition zone") directly influences navigation costs and action feasibility (Xu et al., 2024, Wang et al., 17 Dec 2025, Rotondi et al., 10 Mar 2025).

5. Scalability, Interpretation, and Evaluation

Hierarchical 3DSGs provide critical advantages in scalability and interpretability:

Hierarchical Compression: Sparsification via node hierarchy (e.g., S-Graph marginalization, room-local graph compression) reduces computation and storage by orders of magnitude compared to flat or grid-based approaches—speedups of ≈40% in SLAM optimization and >70% representation compression have been reported (Bavle et al., 2023, Werby et al., 2024).
Semantic Interpretability: Nodes correspond to human-meaningful abstractions (rooms, corridors, objects); graph attributes (predicted flows, occupancy, function) are directly actionable for planning or human-interaction (Catalano et al., 10 Dec 2025).
Cross-Domain Generality: Systems such as KeySG, Terra, and HERO show that hierarchical 3DSGs generalize to open-vocabulary object segmentation, hierarchical retrieval, and both indoor and outdoor navigation, with consistent improvements in mIoU, retrieval accuracy, and path efficiency over baselines, as well as robust handling of multi-story and multi-agent settings (Werby et al., 1 Oct 2025, Samuelson et al., 23 Sep 2025, Wang et al., 17 Dec 2025, Samuelson et al., 6 Jun 2025).

Selected benchmarking results:

Method	Application Domain	Key Metric	Value/Improvement
Aion (Catalano et al., 10 Dec 2025)	Temporal flow prediction	25% reduction high-density occupancy duration	Compared to grid-based MoD
KeySG (Werby et al., 1 Oct 2025)	Semantic segmentation	mAcc / F-mIoU (Replica)	45.8 / 46.2
S-Graphs (Bavle et al., 2023)	SLAM trajectory optimization	Computation time reduction	39.8%
HERO (Wang et al., 17 Dec 2025)	Navigability (movable obstacles)	SR +79.4%, PL –35.1%	Over Voronoi-only baseline

6. Limitations and Open Challenges

Several limitations persist in current hierarchical 3DSG frameworks:

Sensing and Observation Sparsity: Sparse agent visitation can yield incomplete or noisy dynamic estimates; temporal models may fail in unseen or under-observed regions (Catalano et al., 10 Dec 2025).
Dynamic and Multi-Modal Flow Modeling: Most current models focus on directional flow and do not yet robustly capture variable agent speeds or multi-modal flow distributions.
Real-Time Update and Adaptation: Many advanced graph construction and query mechanisms (e.g., KeySG’s RAG pipeline, FunGraph’s part-level augmentation) are performed offline and require substantial computation and LLM/VLM invocation (Werby et al., 1 Oct 2025, Rotondi et al., 10 Mar 2025).
Region and Affordance Clustering: Functional region formation often uses heuristic or affordance-based clustering; geometric continuity and relationship modeling between higher-order regions remain open challenges (Xu et al., 2024).
Physical Realism and Adaptivity: Interaction modeling (e.g., movable obstacle cost in HERO) assumes uniform or manually specified penalties, not directly accounting for mass, friction, or multi-agent interaction (Wang et al., 17 Dec 2025).

Future directions include dynamic graph updates, region-level reasoning beyond strict containment, multi-robot graph merging, temporal graph expansion (4D), and end-to-end learning of affordance and function in large, real-world datasets.

7. Comparative Perspective and Family of Approaches

Hierarchical 3DSGs unify and surpass several prior modeling traditions:

Classical Scene Graphs: Static, instance-level and semantic relations; no hierarchy or dynamics [Rosinol et al., Kimera].
Dynamic SGs (object-centric): Agent/vehicle-focused temporal modeling; limited to instance trajectories, without collective or navigational flow (Greve et al., 2023).
Maps of Dynamics (MoD): Cell/grid-based, temporally-aware, but lacking in semantic or hierarchical abstraction [Molina et al.].
Hierarchical, Semantic, and Temporal Graphs: Modern approaches, including Aion, HERO, KeySG, TB-HSU, OVIGo-3DHSG, and SceneHGN, deliver multi-level, semantically-enriched, dynamic, and open-vocabulary models that support scalable, interpretable, and generalizable embodied AI pipelines (Catalano et al., 10 Dec 2025, Wang et al., 17 Dec 2025, Werby et al., 1 Oct 2025, Xu et al., 2024, Linok et al., 16 Jul 2025, Gao et al., 2023).

This integration of spatial hierarchy, temporal dynamics, semantic richness, and computational scalability positions hierarchical 3D Scene Graphs as a foundational representation for advanced robot perception, planning, and spatial-language interaction.