Structured Semantic 3D Reconstruction (S23DR) Challenge 2025 -- Winning solution

Published 19 Jun 2025 in cs.CV | (2506.16421v1)

Abstract: This paper presents the winning solution for the S23DR Challenge 2025, which involves predicting a house's 3D roof wireframe from a sparse point cloud and semantic segmentations. Our method operates directly in 3D, first identifying vertex candidates from the COLMAP point cloud using Gestalt segmentations. We then employ two PointNet-like models: one to refine and classify these candidates by analyzing local cubic patches, and a second to predict edges by processing the cylindrical regions connecting vertex pairs. This two-stage, 3D deep learning approach achieved a winning Hybrid Structure Score (HSS) of 0.43 on the private leaderboard.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces a novel 3D pipeline that bypasses 2D limitations, achieving an HSS of 0.43 for accurate roof wireframe reconstruction.
It employs a three-stage method—vertex candidate generation, refinement, and edge prediction—using PointNet-based architectures to address 3D misalignments.
Experimental analyses and ablation studies demonstrate enhanced vertex and edge detection, offering practical efficiency and scalability for real-world applications.

Structured Semantic 3D Reconstruction (S23DR) Challenge 2025 — Winning Solution

The paper presents the methodology and results of the winning entry for the S23DR Challenge 2025, where the objective is accurate prediction of house roof 3D wireframes from sparse point clouds and semantic segmentations. The approach marks a decisive departure from 2D-driven methods, instead implementing a pipeline that operates natively in the 3D domain. The design leverages PointNet-based deep models to localize vertices and to predict edge connectivity directly from the 3D point cloud, circumventing the limitations of 2D-to-3D feature lifting inherent in baseline solutions.

Problem Formulation and Dataset Considerations

The core task is wireframe reconstruction: recovering roof structure as a set of 3D vertices and their interconnecting edges. Evaluation is performed using the Hybrid Structure Score (HSS), the harmonic mean of the vertex F1 score and edge IoU, incentivizing balanced accuracy for detection and connectivity.

Inputs for each instance are derived from the HoHo25k dataset, comprising:

Sparse 3D point clouds and camera parameters (via COLMAP SfM).
Metric depth maps (Metric3Dv2) and two sets of semantic segmentations (ADE20k, Gestalt).
The absence of original RGB images poses strict constraints, increasing reliance on derived geometric and segmentation cues. The point clouds are often incomplete due to real-world occlusions, and the dataset exhibits non-trivial misalignment between 3D and 2D (projection) data because of inconsistent camera parameter sets.

Methodological Contributions

Critique of the Baseline

The baseline operates in a 2D-first paradigm: it detects semantic features (vertices and edges) in segmentation masks and lifts these to 3D either via point cloud correspondence or depth projection. This approach is sensitive to errors and misalignment between the 2D and 3D data representations, resulting in spurious and inaccurate reconstructions.

Three-Stage 3D Pipeline

To counteract these weaknesses, the authors propose a robust, three-stage direct-3D solution:

3D Vertex Candidate Generation
- Utilizing Gestalt segmentation classes, vertices are hypothesized by clustering COLMAP points projected from segmentation masks ("apex", "eave end point", etc.) across all views.
- To address the misalignment between modalities, binary dilation is iteratively applied to segmentation masks until clusters capture a sufficient number of 3D points.
- Clusters are merged based on spatial overlap criteria.
Vertex Refinement and Classification
- For each candidate, a 4m3 patch is extracted and represented in an 11D feature space (relative coordinates, color, segmentation labels, candidate flags).
- An extended PointNet-like architecture, with multi-task output heads for classification, position regression, and confidence scoring, is used.
- The model outputs a refined position, a validity confidence, and a semantic label. Low-scoring candidates are pruned.

# Vertex network forward pass (simplified)
def vertex_net(patch_points):
    features = pointwise_conv(patch_points)
    attended = channel_attention(features)
    pooled = 0.7 * features.max(dim=1) + 0.3 * features.mean(dim=1)
    shared = mlp_layers(pooled)
    position = position_head(shared)
    confidence = confidence_head(shared)
    class_logits = class_head(shared)
    return position, confidence, class_logits

Edge Prediction
- For every vertex pair, a cylindrical region centered on the line between them is defined, populated with all included points.
- A secondary, lighter PointNet classifier consumes the 6D representation of these points (relative position and color), producing a binary score for the existence of an edge.

# Edge existence classification (simplified)
def edge_classifier(cylinder_points):
    features = pointwise_conv(cylinder_points)
    pooled = features.max(dim=1)
    edge_logit = mlp_layers(pooled)
    return torch.sigmoid(edge_logit)

This architecture ensures that both vertex localization and edge detection are purely 3D and learn robust geometric priors, mitigating upstream input noise and cross-modality errors.

Experimental Analysis and Ablation

An ablation study demonstrates the incremental gains from each system component:

Configuration	Mean HSS	Mean F1	Mean IoU
Baseline	0.148	0.220	0.122
+ tuned parameters	0.233	0.304	0.200
3D segmentation (no DL)	0.257	0.345	0.217
+ vertex classification	0.260	0.387	0.207
+ edge classification	0.286	0.387	0.239

The final pipeline, with both vertex and edge classifiers enabled, achieves an HSS of 0.43 on the private leaderboard—demonstrating the value of properly modeling both geometric localization and connectivity.

Thresholding Effects

The performance is robust to the vertex confidence threshold as long as it is below 0.7. The edge classifier threshold affects IoU and HSS substantially, affirming the importance of calibrating edge scores for operational settings.

Efficiency and Practical Considerations

Both neural networks are lightweight and were trained on a single A100 GPU within hours; inference per house is dominated by the quadratic number of edge checks but still remains practical (<10ms per pair on a T500 GPU).
The approach scales linearly with the number of buildings, but edge prediction has inherent quadratic complexity with respect to vertex count.
No data augmentation was used; further robustness could be achieved by integrating geometric and photometric transformations.

Limitations and Future Directions

The chief limitation is the pipeline's reliance on the accuracy and completeness of the upstream sparse point cloud and semantic segmentation. Model failures are directly attributable to input artefacts such as missing geometry or segmentation misclassifications, for which the network is not designed to compensate. Improvements in point cloud density or segmentation reliability would result in corresponding gains.

Potential research avenues include:

Employing more advanced 3D backbones incorporating self-attention or stronger local context aggregation.
Joint, end-to-end learning of the whole wireframe as a graph from input points for improved structure consistency.
Augmenting with simulated data or more aggressive augmentation to improve generalization.

Implications

The proposed approach clarifies several practical and theoretical issues for the field:

Geometric Robustness: Direct 3D modeling sidesteps major failure modes associated with 2D-to-3D lifting under real-world misalignment and incomplete data.
Efficiency: Effective wireframe prediction can be achieved without heavyweight models or multi-image, multi-scale fusion.
Scalability: The modular pipeline lends itself to industrial GIS or roof analysis workflows, especially where camera or image data remain proprietary or unavailable.
Extensibility: The research lays groundwork for more unified segmentation-connectivity tasks in structured 3D reconstruction, suggesting that further efficiency and accuracy gains are likely with more expressive neural architectures.

In summary, the winning solution exemplifies the advantages of 3D point-centric deep learning for structured geometric prediction and provides a practical implementation template for related vision-reconstruction problems. Future research in this direction will likely unify geometric, semantic, and structural inference for challenging real-world spatial AI tasks.

Markdown Report Issue