PHiSSG: Hierarchical Spatial–Semantic Graph
- PHiSSG is a hierarchical data structure that represents 3D environments with nodes encoding spatial positions, orientations, semantic embeddings, and refinement levels.
- It incrementally updates the graph through anchor selection and local subscene generation, using recursive propagation of rigid-body transforms to maintain spatial and semantic consistency.
- Empirical evaluations show PHiSSG’s effectiveness in real-time mapping and generative scene synthesis, achieving robust object detection, localization, and stability under physical constraints.
The Progressive Hierarchical Spatial–Semantic Graph (PHiSSG) is a data structure that underpins hierarchical representations of 3D environments, enabling incremental, scalable, and semantically consistent scene understanding and generation. Originally formalized within the context of spatial perception for robotics (Hughes et al., 2023) and hierarchical scene synthesis (Hong et al., 31 Oct 2025), PHiSSG encodes both spatial geometry and semantic relationships while maintaining a record of hierarchical refinement across progressive steps or levels of detail.
1. Formal Structure and Layered Hierarchy
PHiSSG represents the environment or generated scene as a directed, layered graph , where each node corresponds to a unique 3D object instance introduced at a specific refinement step . The graph structure enforces a hierarchy:
- Node Set (): Each node encodes
- 3D center position ,
- orientation as a unit quaternion ,
- semantic category embedding ,
- hierarchical index (generation step).
Collectively, all features are stored as .
- Edge Tensor (): Each directed edge links node to by some relation (e.g., “On,” “Inside,” “Near”). For each relation type , is a binary adjacency matrix.
- Layer Partitioning and Levels: Nodes are tagged by their level , supporting hierarchical visualization and stepwise refinement or expansion.
Edges between nodes encode spatial (“strong," e.g., physical support or containment) or semantic (“soft,” e.g., stylistic similarity, adjacency) relations. Strong dependencies carry relative rigid-body transforms that link a child's pose rigidly to its parent.
2. Graph Construction and Update Algorithms
The graph evolves incrementally, supporting both new object insertion and scene refinement:
- Initialization (): Start with global objects (e.g., room architecture, furniture); extract nodes and relations via vision models, and set for these nodes.
- Incremental Update (step ):
- Assign for and append to .
- Set for pairwise relations in .
- Create cross-level edges from anchors to new nodes (typically with strong dependencies).
- 4. Recursive Layout Optimization: Any update to a parent’s pose is propagated to its children along “strong” dependency edges using recorded transforms:
- Stability Correction: Ensures that “On” relations respect physical plausibility; the child’s projected center onto the parent’s support polygon is adjusted by the minimal horizontal shift
A high-level pseudocode captures the entire process, supporting initialization, local refinement, edge updates, recursive position propagation, and stability correction (Hong et al., 31 Oct 2025).
3. Mathematical Objectives and Consistency Enforcement
PHiSSG enforces several objectives, deterministically maintained during graph update rather than through gradient-based learning:
- Dependency Consistency:
Minimizes deviation between actual and recorded relative transforms.
- Stability Constraint:
Penalizes violation of physical support constraints.
- Semantic Coherence:
Encourages compatible semantic embeddings among linked objects.
The total objective is a weighted sum: .
4. Role in Generative and Perceptual Systems
PHiSSG is integral to frameworks for both real-time spatial perception and generative scene composition:
- Spatial Perception (Hydra system): Implements a five-layer hierarchy (geometric mesh, objects/agents, places, rooms, building), enabling real-time construction, efficient inference via small treewidth, and global optimization including loop-closure corrections. Layers are built incrementally from sensor data (RGB, depth, odometry), semantic segmentation, object clustering, GVD-based free-space topology, and room classification via neural networks (Hughes et al., 2023). The structure supports efficient optimization and robust mapping in robotic systems.
- Generative Scene Synthesis (HiGS framework): PHiSSG acts as the persistent memory that spans progressive, user-driven synthesis rounds in 3D scene generation. Its hierarchical tagging and one-to-one node-object mapping ensure spatial, geometric, and semantic consistency as sub-scenes are merged, objects added or refined, and global style is maintained. Recursive update protocols guarantee that changes propagate coherently throughout the hierarchy, preserving both local plausibility and scene-wide constraints (Hong et al., 31 Oct 2025).
5. Ensuring Spatial, Semantic, and Hierarchical Coherence
PHiSSG guarantees consistency by:
- Maintaining a one-to-one correspondence between nodes and 3D object instances, preventing duplication or omission.
- Explicitly recording and recursively applying rigid-body transforms on "strong" parent-child dependencies, enabling accurate local-global alignment after layout changes.
- Enforcing stability, so no object violates physical constraints of support (e.g., a lamp stays atop a table).
- Encoding semantic coherence through semantic edges, with enforcement via visual-LLMs and embedding similarity.
- Utilizing the hierarchical index to restrict refinements to the correct level of abstraction and preserve generational context.
- Following each graph update with deterministic pose and stability propagation, ensuring end-to-end scene consistency (Hong et al., 31 Oct 2025).
6. Empirical Performance and Implementation
Empirical results in the Hydra system demonstrate PHiSSG’s scalability and efficiency:
- Real-time 3D mapping with semantic labeling at 1 Hz on embedded hardware.
- Mesh/object/place updates in sub-100 ms per keyframe; room clustering in 5–15 ms.
- Significant accuracy for object detection (70–95% found/correct in small/medium scenes), place localization (5–20 cm average error), and room classification (45–55% on synthetic, 30% on real data).
- Lower memory usage compared to non-hierarchical ESDF-based methods, and improved loop-closure detection rates.
In the HiGS generative setting, PHiSSG facilitates user-guided, multi-step scene expansion, yielding state-of-the-art controllability and scene plausibility compared to single-stage frameworks (Hong et al., 31 Oct 2025).
7. Significance and Broader Implications
PHiSSG represents a key advance in bridging geometric structure with semantic understanding for both autonomous agents and generative models. Its progressive and hierarchical formulation allows dynamic expansion, efficient inference, robust optimization, and coherent scene manipulation at scale. The strict correspondence between objects, graph levels, and relational edges provides a precise scaffold for integrating multi-modal perceptual signals or user intent, suggesting broad applicability across robotics, virtual environment generation, and human-in-the-loop scene design (Hughes et al., 2023, Hong et al., 31 Oct 2025).