Semantic TSDF: 3D Geometric-Semantic Fusion

Updated 3 February 2026

Semantic TSDF is a volumetric representation that jointly encodes truncated signed distances and semantic class probabilities for 3D scenes.
It fuses multi-sensor data using learned confidence weights, enabling efficient denoising, inpainting, and watertight per-class mesh extraction.
Adaptive extensions improve large-scale mapping by balancing detail and completeness, resulting in photometrically consistent reconstructions with higher semantic accuracy.

A Semantic Truncated Signed Distance Function (Semantic TSDF) is a volumetric scene representation that extends the classical truncated signed distance field concept to jointly encode both geometric and semantic information. By associating every volumetric grid element (voxel) with not only a signed distance to the nearest surface but also a vector of semantic class probabilities, the semantic TSDF provides an efficient, implicit basis for 3D semantic reconstruction, enabling robust volumetric fusion, denoising, inpainting, and watertight per-class mesh extraction for complex, multi-class environments (Rozumnyi et al., 2019, Hu et al., 2022).

1. Mathematical Formulation and Data Structures

Let $S \subset \mathbb{R}^3$ denote the unknown true scene surface. For a spatial location $x \in \mathbb{R}^3$ , the classical signed distance function is

$\phi(x) = \pm \min_{y \in S} \|x - y\|_2$

with the sign chosen so that $\phi(x) > 0$ outside the surface, $\phi(x) < 0$ inside.

The truncated signed distance function (TSDF) clamps this distance within $[-\tau, +\tau]$ (where $\tau$ is a small constant, e.g., a few voxel lengths):

$f(x) = \text{clamp} \left( \phi(x), -\tau, +\tau \right )$

A semantic TSDF volume augments each voxel at location $x$ with:

$f(x) \in [-\tau, \tau]$ : truncated signed distance
$w(x) \in \mathbb{R}_+$ : fusion confidence weight
$p(x) = [p_1(x), ..., p_L(x)] \in [0,1]^L$ : class probability vector, $\sum_{l=1}^L p_l(x) = 1$

Fusion with multiple sensors or segmentation sources further incorporates their per-voxel confidence weights $c_i(x)$ , yielding the following update rules for observation $i$ and voxel $x$ : $\begin{align*} w_{\rm new}(x) & = w_{\rm old}(x) + c_i(x) \ f_{\rm new}(x) & = \frac{ w_{\rm old}(x) f_{\rm old}(x) + c_i(x) f^{\rm (meas)}_i(x) }{ w_{\rm new}(x) } \ p_{\rm new}(x) & = \frac{ w_{\rm old}(x) p_{\rm old}(x) + c_i(x) e_{\ell_i(x)} }{ w_{\rm new}(x) } \end{align*}$ where $f^{\rm (meas)}_i(x)$ is the TSDF measured by observation $i$ , $e_{\ell_i(x)}$ is the one-hot semantic class vector, and $L$ is the number of semantic classes (Rozumnyi et al., 2019).

2. Fusion with Multiple Sensors and Confidence Estimation

Semantic TSDF frameworks account for multi-sensor heterogeneity by learning per-sensor, per-voxel confidences $c_i(x)$ . These confidences are not fixed but predicted using an MLP (multilayer perceptron) that processes a 13-dimensional feature vector $v_i(x)$ extracted from 2D sensor data around the back-projected voxel position. Features include aggregated depth patches, local intensity gradients (texture), and in stereo, normalized cross-correlation statistics (Rozumnyi et al., 2019).

The trained MLP $g_{\theta_i}$ per sensor yields:

$c_i(x) = g_{\theta_i}(v_i(x))$

These confidences modulate the impact of each sensor's observation on the fused TSDF and semantic distributions, down-weighting unreliable measurements due to surface orientation, occlusion, or low texture.

3. Variational Inference and End-to-End Learning

Semantic TSDF approaches incorporate a global inference procedure where the goal is to predict, for every $x$ , a semantic labeling $u(x) \in \Delta^{L+1}$ (probability simplex of $L$ semantic labels plus a free-space class). The optimization minimizes an energy functional: $\min_{u(x) \ge 0, \sum_\ell u_\ell(x) = 1} \int_\Omega \left[ \|W u(x)\|_2 + \sum_s (c_s \odot f_s)(x) \cdot u(x) \right] dx$ where $W$ is a learned local convolutional operator imposing a regularization prior on semantic-label transitions (e.g., penalizing improbable class adjacencies), $c_s$ are per-sensor confidences, and $f_s$ are per-sensor TSDFs.

This variational problem is solved efficiently by unrolling a fixed number of Chambolle–Pock-style primal-dual optimization iterations as differentiable network layers, ensuring end-to-end trainability (Rozumnyi et al., 2019).

Learning proceeds by minimizing a cross-entropy loss between the predicted $u(x)$ and ground truth, separately for occupied ("semantic") and free-space voxels.

4. Semantic Fusion, Completion, and Per-Class Surface Extraction

The semantic TSDF framework supports semantic fusion from multi-view segmentations by projecting each 2D segmentation into the 3D volume, incrementally updating semantic class distributions $p(x)$ per voxel, and performing volumetric smoothing via the global variational formulation.

Crucially, this approach enables denoising (removal of outlier voxels), completion (closing of holes arising from missing depth data), and geometric regularization (suppressing local artifacts, as enforced by semantic priors and learned regularizers) (Rozumnyi et al., 2019). For each class $l$ , the zero-isosurface of $f(x)$ restricted to label $l$ yields a watertight per-class mesh via Marching Cubes.

5. Large-Scale and Adaptive Extensions

The semantic TSDF has been adapted for outdoor mapping from sparse LiDAR (as opposed to dense RGB-D) via adaptive selection of the TSDF truncation band $\varepsilon$ per voxel block, balancing detail preservation against completeness in the face of varying point densities (Hu et al., 2022).

Adaptive truncation determines $\varepsilon$ from local plane statistics: $\varepsilon = \max\left(\varepsilon_{\min}, \frac{k P_{\mathrm{flat}}}{n} \varepsilon_{\max}\right)$ where $n$ is the number of points in the block and $P_{\mathrm{flat}}$ a flatness score derived from block covariance. This ensures robust fusion across heterogeneous environments, as verified by improved completeness and geometry fidelity in automotive-scale scenes (Hu et al., 2022).

Surface extraction, optimal image-patch selection for geometry texturing, and Markov random field–based per-face semantic fusion yield photometrically consistent, seam-reduced, and densely labeled reconstructions suitable for high-definition mapping and simulation content generation.

6. Experimental Results and Benchmarks

Empirical evaluation demonstrates that semantic TSDF methods yield improved semantic accuracy (SA) and true-positive rates over standard TSDF fusion, especially in multi-sensor and denoising/completion settings. For instance, fusing depth from Kinect and stereo, using learned per-sensor weights and a variational prior, achieves SA of 0.786 versus 0.77–0.71 for single sensors on the SUNCG dataset; upper bounds with ground-truth data are ≈0.80 (Rozumnyi et al., 2019). On ScanNet, geometric and semantic metrics improve substantially, with the fused approach reducing surface errors and increasing semantic completeness.

Large-scale, adaptive semantic TSDF has been validated on automotive datasets such as KITTI and real vehicle captures, with experiments showing that variable $\varepsilon$ recovers fine structure (road markings, façades) missed by fixed-band approaches, and that Markov random field fusion for texture and semantics yields fewer visible seams and more robust class labeling (Hu et al., 2022).

7. Limitations and Prospective Directions

Semantic TSDF representations rely on a trade-off between memory/storage and geometric detail, with reconstruction resolution ultimately limited by sensor noise, pose accuracy, and the chosen grid granularity. Adapting voxel size per class or region, leveraging more sophisticated semantic priors (e.g., CRF or neural regularizers), and integrating detected scene regularities (such as planar walls or floors, as in FAWN (Sokolova et al., 2024)), represent prospective directions.

Other promising extensions include semantic mesh-based HD map extraction, re-synthesis of training data for novel viewpoints, and optimizing the efficiency of large-scale graph-based label inference, especially for automotive and robotics scenarios where scalability and throughput are essential (Hu et al., 2022, Rozumnyi et al., 2019).