Papers
Topics
Authors
Recent
Search
2000 character limit reached

Semantic TSDF: 3D Geometric-Semantic Fusion

Updated 3 February 2026
  • Semantic TSDF is a volumetric representation that jointly encodes truncated signed distances and semantic class probabilities for 3D scenes.
  • It fuses multi-sensor data using learned confidence weights, enabling efficient denoising, inpainting, and watertight per-class mesh extraction.
  • Adaptive extensions improve large-scale mapping by balancing detail and completeness, resulting in photometrically consistent reconstructions with higher semantic accuracy.

A Semantic Truncated Signed Distance Function (Semantic TSDF) is a volumetric scene representation that extends the classical truncated signed distance field concept to jointly encode both geometric and semantic information. By associating every volumetric grid element (voxel) with not only a signed distance to the nearest surface but also a vector of semantic class probabilities, the semantic TSDF provides an efficient, implicit basis for 3D semantic reconstruction, enabling robust volumetric fusion, denoising, inpainting, and watertight per-class mesh extraction for complex, multi-class environments (Rozumnyi et al., 2019, Hu et al., 2022).

1. Mathematical Formulation and Data Structures

Let SR3S \subset \mathbb{R}^3 denote the unknown true scene surface. For a spatial location xR3x \in \mathbb{R}^3, the classical signed distance function is

ϕ(x)=±minySxy2\phi(x) = \pm \min_{y \in S} \|x - y\|_2

with the sign chosen so that ϕ(x)>0\phi(x) > 0 outside the surface, ϕ(x)<0\phi(x) < 0 inside.

The truncated signed distance function (TSDF) clamps this distance within [τ,+τ][-\tau, +\tau] (where τ\tau is a small constant, e.g., a few voxel lengths):

f(x)=clamp(ϕ(x),τ,+τ)f(x) = \text{clamp} \left( \phi(x), -\tau, +\tau \right )

A semantic TSDF volume augments each voxel at location xx with:

  • f(x)[τ,τ]f(x) \in [-\tau, \tau]: truncated signed distance
  • w(x)R+w(x) \in \mathbb{R}_+: fusion confidence weight
  • p(x)=[p1(x),...,pL(x)][0,1]Lp(x) = [p_1(x), ..., p_L(x)] \in [0,1]^L: class probability vector, l=1Lpl(x)=1\sum_{l=1}^L p_l(x) = 1

Fusion with multiple sensors or segmentation sources further incorporates their per-voxel confidence weights ci(x)c_i(x), yielding the following update rules for observation ii and voxel xx: wnew(x)=wold(x)+ci(x) fnew(x)=wold(x)fold(x)+ci(x)fi(meas)(x)wnew(x) pnew(x)=wold(x)pold(x)+ci(x)ei(x)wnew(x)\begin{align*} w_{\rm new}(x) & = w_{\rm old}(x) + c_i(x) \ f_{\rm new}(x) & = \frac{ w_{\rm old}(x) f_{\rm old}(x) + c_i(x) f^{\rm (meas)}_i(x) }{ w_{\rm new}(x) } \ p_{\rm new}(x) & = \frac{ w_{\rm old}(x) p_{\rm old}(x) + c_i(x) e_{\ell_i(x)} }{ w_{\rm new}(x) } \end{align*} where fi(meas)(x)f^{\rm (meas)}_i(x) is the TSDF measured by observation ii, ei(x)e_{\ell_i(x)} is the one-hot semantic class vector, and LL is the number of semantic classes (Rozumnyi et al., 2019).

2. Fusion with Multiple Sensors and Confidence Estimation

Semantic TSDF frameworks account for multi-sensor heterogeneity by learning per-sensor, per-voxel confidences ci(x)c_i(x). These confidences are not fixed but predicted using an MLP (multilayer perceptron) that processes a 13-dimensional feature vector vi(x)v_i(x) extracted from 2D sensor data around the back-projected voxel position. Features include aggregated depth patches, local intensity gradients (texture), and in stereo, normalized cross-correlation statistics (Rozumnyi et al., 2019).

The trained MLP gθig_{\theta_i} per sensor yields:

ci(x)=gθi(vi(x))c_i(x) = g_{\theta_i}(v_i(x))

These confidences modulate the impact of each sensor's observation on the fused TSDF and semantic distributions, down-weighting unreliable measurements due to surface orientation, occlusion, or low texture.

3. Variational Inference and End-to-End Learning

Semantic TSDF approaches incorporate a global inference procedure where the goal is to predict, for every xx, a semantic labeling u(x)ΔL+1u(x) \in \Delta^{L+1} (probability simplex of LL semantic labels plus a free-space class). The optimization minimizes an energy functional: minu(x)0,u(x)=1Ω[Wu(x)2+s(csfs)(x)u(x)]dx\min_{u(x) \ge 0, \sum_\ell u_\ell(x) = 1} \int_\Omega \left[ \|W u(x)\|_2 + \sum_s (c_s \odot f_s)(x) \cdot u(x) \right] dx where WW is a learned local convolutional operator imposing a regularization prior on semantic-label transitions (e.g., penalizing improbable class adjacencies), csc_s are per-sensor confidences, and fsf_s are per-sensor TSDFs.

This variational problem is solved efficiently by unrolling a fixed number of Chambolle–Pock-style primal-dual optimization iterations as differentiable network layers, ensuring end-to-end trainability (Rozumnyi et al., 2019).

Learning proceeds by minimizing a cross-entropy loss between the predicted u(x)u(x) and ground truth, separately for occupied ("semantic") and free-space voxels.

4. Semantic Fusion, Completion, and Per-Class Surface Extraction

The semantic TSDF framework supports semantic fusion from multi-view segmentations by projecting each 2D segmentation into the 3D volume, incrementally updating semantic class distributions p(x)p(x) per voxel, and performing volumetric smoothing via the global variational formulation.

Crucially, this approach enables denoising (removal of outlier voxels), completion (closing of holes arising from missing depth data), and geometric regularization (suppressing local artifacts, as enforced by semantic priors and learned regularizers) (Rozumnyi et al., 2019). For each class ll, the zero-isosurface of f(x)f(x) restricted to label ll yields a watertight per-class mesh via Marching Cubes.

5. Large-Scale and Adaptive Extensions

The semantic TSDF has been adapted for outdoor mapping from sparse LiDAR (as opposed to dense RGB-D) via adaptive selection of the TSDF truncation band ε\varepsilon per voxel block, balancing detail preservation against completeness in the face of varying point densities (Hu et al., 2022).

Adaptive truncation determines ε\varepsilon from local plane statistics: ε=max(εmin,kPflatnεmax)\varepsilon = \max\left(\varepsilon_{\min}, \frac{k P_{\mathrm{flat}}}{n} \varepsilon_{\max}\right) where nn is the number of points in the block and PflatP_{\mathrm{flat}} a flatness score derived from block covariance. This ensures robust fusion across heterogeneous environments, as verified by improved completeness and geometry fidelity in automotive-scale scenes (Hu et al., 2022).

Surface extraction, optimal image-patch selection for geometry texturing, and Markov random field–based per-face semantic fusion yield photometrically consistent, seam-reduced, and densely labeled reconstructions suitable for high-definition mapping and simulation content generation.

6. Experimental Results and Benchmarks

Empirical evaluation demonstrates that semantic TSDF methods yield improved semantic accuracy (SA) and true-positive rates over standard TSDF fusion, especially in multi-sensor and denoising/completion settings. For instance, fusing depth from Kinect and stereo, using learned per-sensor weights and a variational prior, achieves SA of 0.786 versus 0.77–0.71 for single sensors on the SUNCG dataset; upper bounds with ground-truth data are ≈0.80 (Rozumnyi et al., 2019). On ScanNet, geometric and semantic metrics improve substantially, with the fused approach reducing surface errors and increasing semantic completeness.

Large-scale, adaptive semantic TSDF has been validated on automotive datasets such as KITTI and real vehicle captures, with experiments showing that variable ε\varepsilon recovers fine structure (road markings, façades) missed by fixed-band approaches, and that Markov random field fusion for texture and semantics yields fewer visible seams and more robust class labeling (Hu et al., 2022).

7. Limitations and Prospective Directions

Semantic TSDF representations rely on a trade-off between memory/storage and geometric detail, with reconstruction resolution ultimately limited by sensor noise, pose accuracy, and the chosen grid granularity. Adapting voxel size per class or region, leveraging more sophisticated semantic priors (e.g., CRF or neural regularizers), and integrating detected scene regularities (such as planar walls or floors, as in FAWN (Sokolova et al., 2024)), represent prospective directions.

Other promising extensions include semantic mesh-based HD map extraction, re-synthesis of training data for novel viewpoints, and optimizing the efficiency of large-scale graph-based label inference, especially for automotive and robotics scenarios where scalability and throughput are essential (Hu et al., 2022, Rozumnyi et al., 2019).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Semantic Truncated Signed Distance Function (TSDF).