Self-Supervised 3D Scene Completion

Updated 22 January 2026

Self-supervised scene completion is a method that infers complete 3D geometry and semantic labels from partial, noisy data using proxy objectives and multi-view cues.
It leverages implicit neural fields, explicit voxel representations, and object-level de-occlusion networks to predict occluded surfaces and recover amodal structures.
Experimental results show competitive geometric reconstruction compared to supervised models, though challenges remain in precise semantic prediction and dynamic scene handling.

Self-supervised scene completion addresses the problem of inferring complete geometric and semantic 3D descriptions of real-world scenes from partial, noisy, or incomplete observations—most critically, in the absence of densely labeled 3D ground truth. This paradigm leverages intrinsic structure and multi-view consistency in data such as videos, RGB(-D) scan sequences, or multi-view images to supervise models through proxy objectives, sidestepping the cost and limitations of exhaustive human annotation or specialized sensors. Major advances include reconstructing unseen geometry, predicting semantic labels for occluded regions, recovering amodal object structures, and producing detailed 3D feature fields competitive with supervised alternatives.

1. Problem Scope and Formulation

Self-supervised scene completion formalizes several interrelated tasks within 3D perception:

Geometric scene completion: Predicts occupancy or signed distance fields at each voxel or 3D point, such that occluded or unobserved surfaces are inferred based on partial measurements (e.g., RGB-D, monocular images, multi-view video).
Semantic scene completion (SSC): Simultaneously estimates per-voxel occupancy and assigns semantic class labels (over $C$ categories) to completed geometry.
Amodal scene parsing and de-occlusion: Recovers hidden object extent and determines occlusion order, estimating both the full (amodal) masks and appearance of partially visible scene objects.

Formally, for a continuous 3D domain $V\subset \mathbb R^3$ , models estimate a mapping: $x \in V \to (\sigma(x),\, p(s|x))$ where $\sigma(x)\in[0,1]$ is the predicted occupancy/density and $p(s|x)$ is a categorical semantic probability (or, in geometry-only approaches, a truncated signed distance field $S(x)$ ).

Supervision is indirect: training signals derive from reprojection losses, synthetic occlusion, multi-view photometric consistency, or completion of artificially masked regions. No dense 3D ground-truth labels are required, and commonly only video, posed image sequences, or unannotated RGB(D) frames are used for supervision (Hayler et al., 2023, Han et al., 2024, Jevtić et al., 8 Jul 2025, Dai et al., 2019, Zhan et al., 2020).

2. Foundational Approaches and Model Architectures

Implicit Neural Fields

Recent paradigms (S4C, SceneDINO, KDBTS/MVBTS) model the scene as an implicit field: $\mathcal F: x \mapsto (\sigma(x), l(x)),$ where $l(x)$ are semantic logits and $\sigma(x)$ is a density ( $\sigma:\mathbb R^3\to[0,1]$ ). Networks consist of a 2D feature extractor (e.g., ResNet-50 or DINO-pretrained ViT) producing pixel-aligned features from a single or multiple images, followed by one or more MLP decoders that, for a query point $x$ , retrieve local features from the associated view(s), concat with positional encoding $\gamma(d,u)$ , and output occupancy and semantics (Hayler et al., 2023, Jevtić et al., 8 Jul 2025, Han et al., 2024).

In SceneDINO, the 2D encoder produces $E \in \mathbb R^{D_e \times H \times W}$ ; the 3D MLP maps queries (via bilinear interpolation at projected $u=\pi_0(x)$ ) to $(\sigma(x),f(x))$ , with subsequent 3D feature distillation to obtain semantics (Jevtić et al., 8 Jul 2025).
In S4C and KDBTS, field queries involve projecting $x$ to image coordinates, sampling features, and decoding density/semantics via two shallow MLPs (Hayler et al., 2023, Han et al., 2024).

Explicit Voxel Representations and Sparse Hierarchies

SG-NN adopts sparse 3D convolutions operating on a TSDF grid, explicitly representing only observed or predicted occupied voxels. The coarse-to-fine network structure enables memory-efficient, high-resolution inference, with each level predicting occupancy and distance, and subsequent levels refining regions flagged as likely occupied (Dai et al., 2019).

Object-level De-occlusion Networks

PCNet-M and PCNet-C operate in the 2D image domain to recover object orders, amodal masks, and background inpaintings from only modal mask annotations, by artificially erasing visible regions and reconstructing them with partial completion losses (Zhan et al., 2020).

3. Self-Supervised Learning Objectives

Self-supervision in scene completion is realized through several interlocking objectives:

Multi-view photometric reconstruction: The core loss for implicit field models. Predict rendered colors at target views from the reconstructed 3D field, enforcing image-level consistency (via L1 + SSIM) across views, thus propagating geometric and semantic constraints to occluded regions (Hayler et al., 2023, Han et al., 2024, Jevtić et al., 8 Jul 2025).
Edge-aware depth and feature smoothness: Encourage spatial coherence in depth or feature predictions, penalizing irregularities that do not align with image edges (Hayler et al., 2023, Han et al., 2024, Jevtić et al., 8 Jul 2025).
Semantic consistency and pseudo-label distillation: Use segmentation maps (from off-the-shelf 2D segmenters or unsupervised DINO) as pseudo-targets for rendered semantic predictions, matching softmax outputs at ray samples (Hayler et al., 2023, Jevtić et al., 8 Jul 2025).
3D feature distillation and clustering: Enforce consistency among 3D features reconstructed at surface points for unsupervised semantics; combine contrastive correlation loss (STEGO) on pairs or neighbor samples, then cluster for 3D pseudo-labels (Jevtić et al., 8 Jul 2025).
Partial observation completion losses: In geometry-only models (SG-NN), match predicted TSDF and occupancy only where ground truth is partially available, masking out unobserved regions and employing multi-resolution targets (Dai et al., 2019).
Object ordering and completion: For de-occlusion, train networks by simulating and reconstructing masked regions and enforcing regularization to avoid over-completion (Zhan et al., 2020).

4. Inference, Completion, and Semantics Extraction

At inference, implicit field models densely sample the predicted field within the view frustum (or the whole scene grid) to produce 3D occupancy and semantic labels for all voxels, including those never directly observed. Standard post-processing includes:

Volume marching: For rendering and voxelization, aggregate query results across samples per voxel, assigning occupancies via $\max \sigma(x)$ and semantics via occupancy-weighted averaging (Hayler et al., 2023).
Hole-filling: Nearest-neighbor/adjacent voxel heuristics mitigate isolated missing predictions (Hayler et al., 2023).
Feature clustering: In unsupervised semantics, perform k-means (or similar) on 3D feature vectors, then assign clusters to semantic labels via Hungarian matching (Jevtić et al., 8 Jul 2025).
Ordering-based amodal mask recovery: Propagate inferred occlusion graphs to generate amodal masks and inpaint content recursively, as in scene de-occlusion (Zhan et al., 2020).

5. Experimental Benchmarks and Quantitative Results

Self-supervised scene completion has attained competitive performance against fully supervised alternatives.

Geometric completion (SG-NN, Matterport3D, 5 cm voxels) (Dai et al., 2019):

| Method (supervision) | all | unobs | near-tar | near-pred | |-----------------------|--------|--------|----------|-----------| | 3D-EPN (sup) | 0.31 | 0.28 | 0.45 | 1.12 | | ScanComplete (sup) | 0.20 | 0.15 | 0.51 | 0.74 | | SG-NN (self-sup) | 0.17 | 0.14 | 0.35 | 0.67 |

Semantic Scene Completion, SSCBench-KITTI-360, 51.2 m range (Hayler et al., 2023, Jevtić et al., 8 Jul 2025):

| Method | IoU | mIoU | |-----------------|-------|--------| | SceneDINO (unsup) | 37.60% | 8.00% | | S4C (self-sup, 2D seg) | 39.35% | 10.19% | | VoxFormer-S (3D sup) | — | 13.81% | | OccFormer (3D sup) | — | 19.23% | | SSCNet (3D sup, depth) | — | 38.76% |

Amodal mask completion (KINS) (Zhan et al., 2020):

| Method | mask IoU (KINS) | |----------------|-----------------| | Ours, ordering | 94.8% | | UNet (supervised) | 94.8% | | Convex-R baseline | 90.8% |

Single-view density completion (KITTI-360, O_acc) (Han et al., 2024):

| Method | O_acc | O_prec | O_rec | |----------------|--------|--------|-------| | BTS | 94.47% | 58.73% | 84.24%| | KDBTS (ours) | 94.76% | 60.68% | 84.78%| | MVBTS (multi-view) | 94.91% | 61.73% | 85.78%|

2D–3D domain generalization (SceneDINO) (Jevtić et al., 8 Jul 2025):

| Target ← Source | mIoU (2D U-Seg) | |------------------------|-----------------| | Cityscapes ← KITTI-360 | 22.8% | | BDD100K ← KITTI-360 | 22.1% |

A consistent trend is that self-supervised and unsupervised models close a significant portion of the gap to fully supervised systems, particularly for occupancy prediction, while semantic completion remains challenging—mIoU values lag supervised baselines, but linear probing of learned 3D features suggests further improvements are tractable (Jevtić et al., 8 Jul 2025, Hayler et al., 2023).

6. Strengths, Limitations, and Prospective Directions

Strengths:

Eliminates need for ground-truth 3D (or amodal) annotation; enables learning on large-scale real data with minimal labeling cost (Dai et al., 2019, Hayler et al., 2023, Jevtić et al., 8 Jul 2025).
Models generalize well to novel or distant viewpoints and can fill completely occluded regions by leveraging geometry or semantics learned from data-wide priors (Hayler et al., 2023, Dai et al., 2019).
Enables diverse downstream tasks: novel-view synthesis, 3D semantic segmentation, controllable scene recomposition, 2D–3D annotation conversion (Zhan et al., 2020, Jevtić et al., 8 Jul 2025).

Current limitations:

Static-scene assumption is predominant; performance degrades with dynamic objects, yielding ambiguous completions (Han et al., 2024, Jevtić et al., 8 Jul 2025).
Semantic completion is bottlenecked by quality of 2D segmenters or the discriminability of 2D pretrained features (e.g., DINO).
Multi-view consistency losses require reasonably accurate camera pose estimation (typically available via SLAM).
For SG-NN, only geometric completion is addressed; appearance and semantic inference remain future work (Dai et al., 2019).
Resolution of semantic voxel predictions is often lower than geometric ones due to computational and signal constraints (Jevtić et al., 8 Jul 2025).

Future research directions:

Improved 2D feature/semantic extractors (e.g., adopting masked-autoencoder pretraining in SceneDINO) (Jevtić et al., 8 Jul 2025).
Explicit dynamic scene and motion modeling; handling transparency or specular surfaces beyond Lambertian assumptions (Hayler et al., 2023, Han et al., 2024).
Large-scale unsupervised learning from Internet-scale videos and generalization to more complex scenes and sensor modalities (Jevtić et al., 8 Jul 2025).
Joint completion and semantics with sparse generative architectures, and generalization to non-Euclidean representations (Dai et al., 2019).

Self-supervised scene completion thus constitutes a technically robust and rapidly advancing research area, synthesizing advances in neural fields, sparse generative modeling, self-supervised representation learning, and unsupervised semantic grouping to close the gap toward annotation-free 3D understanding and manipulation.