Deep Local Shapes Reconstruction

Updated 29 December 2025

DeepLS is a deep shape representation method that partitions 3D scenes into local voxels, each encoded by an independent latent code with a shared MLP decoder.
The approach achieves high-fidelity surface reconstructions from partial data, outperforming global latent methods like DeepSDF in both efficiency and accuracy.
DeepLS enables rapid scene encoding and scalable optimization, with quantitative results showing significant improvements in metrics such as Chamfer Distance and RMSE.

Deep Local Shapes (DeepLS) is a deep shape representation approach for high-fidelity 3D surface reconstruction that encodes local signed distance functions (SDFs) in a memory-efficient manner, enabling detailed reconstructions of complex scenes and objects. Unlike methods such as DeepSDF, which rely on a single global latent code per object, DeepLS partitions the scene into local regions, each represented by an independent latent code, and employs a shared multilayer perceptron (MLP) decoder. This decomposition enables scalable and efficient learning of local SDF priors for dense 3D reconstruction from partial observations and limited training data (Chabra et al., 2020).

1. Local SDF Representation

DeepLS models the SDF of a scene as a set of locally defined, continuous SDFs, each parameterized by a local latent code. Formally, let $f_\theta: \mathbb{R}^3 \times \mathbb{R}^d \to \mathbb{R}$ denote a shared MLP (decoder) with weights $\theta$ . For a voxel (local region) $V_i$ of side length $\ell$ centered at $c_i \in \mathbb{R}^3$ , the local latent code $z_i \in \mathbb{R}^d$ encodes the shape of the surface inside $V_i$ . Given a query point $x \in \mathbb{R}^3$ , DeepLS maps $x$ to the local coordinate frame:

$T_i(x) = \frac{x - c_i}{\ell}.$

The local SDF in voxel $i$ is computed as

$\phi_i(x) = f_\theta(T_i(x), z_i).$

The global SDF field is assembled by aggregating the local SDFs for all voxels covering $x$ :

$\phi(x) = \sum_{\{i: x \in V_i\}} \phi_i(x).$

The reconstructed surface $S$ is the zero level set:

$S = \left\{ x \in \mathbb{R}^3 \mid \phi(x) = 0 \right\}.$

2. Network Architecture

The DeepLS decoder $f_\theta$ is a fully connected MLP with four layers, each hidden layer having 128 units and LeakyReLU activation functions. The input layer concatenates the 3D local point $T_i(x)\in \mathbb{R}^3$ and the local shape code $z_i\in \mathbb{R}^{128}$ , resulting in 131 input dimensions. The final output passes through a $\tanh$ nonlinearity and is scaled to fit the SDF truncation range. In practice, the latent code is linearly embedded or concatenated at the network's first layer.

3. Scene Decomposition and Local Regions

DeepLS partitions the scene space into a regular, sparse grid of voxels (typically $\ell\approx5$ –$8$ cm). Latent codes are only allocated to voxels near the observed surface, determined via depth map rasterization or occupancy grid techniques. Each code $z_i$ is responsible for all sample points within an $L_\infty$ distance $r = 1.5\,\ell$ from $c_i$ , effectively extending the voxel’s receptive field to ensure border consistency between adjacent local SDFs. This design enables spatial overlap and local shape sharing, simplifying the learning task for the decoder.

4. Training Objective

Given a dataset of training pairs $\{(x_j, s_j)\}$ —with $s_j$ the ground-truth signed distance at $x_j$ —each $x_j$ is associated with all receptive fields that contain it. For voxel $i$ , define $\mathcal{X}_i$ as the subset of points within its receptive field. The loss function for training is

$\mathcal{L}(\theta, \{z_i\}) = \sum_{i} \sum_{x_j \in \mathcal{X}_i} |f_\theta(T_i(x_j), z_i) - s_j| + \frac{1}{\sigma^2} \sum_i \| z_i \|_2^2,$

an $\ell_1$ SDF regression term plus a Gaussian prior regularization ( $\sigma$ typically set to 0.01).

5. Inference and Scene Encoding

To encode new observations, DeepLS fixes the learned decoder weights $\theta$ and optimizes only the local codes $z_i$ :

$\hat{z}_i = \arg\min_{z_i} \sum_{x_j \in \mathcal{X}_i} |f_\theta(T_i(x_j), z_i) - s_j| + \frac{1}{\sigma^2} \|z_i\|_2^2.$

The optimization for each $z_i$ is independent and highly parallelizable. After convergence, the global SDF $\phi(x)$ is evaluated by summing the local decoders, and the surface is extracted using the Marching Cubes algorithm in a narrow band near observed points.

6. Quantitative Evaluation

DeepLS achieves significant improvements in both efficiency and reconstruction fidelity compared to alternative methods. The following summarizes key results:

Dataset / Task	Metric / Value	Reference
3D Warehouse (object-level)	Chamfer Distance: DeepLS $\approx$ 0.03, DeepSDF 0.20	(Chabra et al., 2020)
Stanford Bunny Efficiency	Full detail in $\sim$ 1 min (RMSE 0.03%); DeepSDF $\sim$ 8 days for same accuracy	(Chabra et al., 2020)
ICL-NUIM (synthetic scene)	Asymmetric Chamfer: TSDF fusion $\approx$ 5.42 mm; DeepLS $\approx$ 4.92 mm (higher completeness at fixed accuracy)	(Chabra et al., 2020)
3D Scene Dataset (real scans)	Completion (error $<$ 7 mm): TSDF (84–91%); DeepLS (88–99%); Error: TSDF (10–14 mm), DeepLS (6–10 mm)	(Chabra et al., 2020)

DeepLS uses $\sim$ 0.05 million decoder parameters and $\sim$ 312,000 code dimensions for 3D Warehouse experiments. At inference, a shape can be encoded in approximately one minute for $\sim$ 10,000 local codes using parallel Adam optimization.

7. Implementation and Memory Aspects

Meshes are preprocessed by sampling points near the surface according to $L_\infty$ uniformity (DeepSDF convention). Point sets from depth scans are augmented by sampling along estimated normals (positive/negative SDF) and free-space points along camera rays, weighted by inverse depth. DeepLS fits comfortably on modern GPUs: for 50,000 voxels with 128D codes, memory usage is roughly 25 MB. Training for 1,000 shapes requires about 12 hours on a single GPU. At test-time, all local codes for an entire scene are typically optimized within one minute, leveraging parallel code inference. The use of an extended receptive field ( $r = 1.5\,\ell$ ) under $L_\infty$ ensures border consistency between overlapping voxels, eliminating the need for explicit blending mechanisms.

By balancing a small shared decoder with a large set of independent local latent codes, DeepLS exhibits high reconstruction fidelity, rapid scene encoding, and broad generalization, combining the advantages of DeepSDF’s learned priors with the scalability and efficiency benefits of sparse local representations (Chabra et al., 2020).

Markdown Report Issue Upgrade to Chat

References (1)

Deep Local Shapes: Learning Local SDF Priors for Detailed 3D Reconstruction (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deep Local Shapes (DeepLS).