Deep Local Shapes Reconstruction
- DeepLS is a deep shape representation method that partitions 3D scenes into local voxels, each encoded by an independent latent code with a shared MLP decoder.
- The approach achieves high-fidelity surface reconstructions from partial data, outperforming global latent methods like DeepSDF in both efficiency and accuracy.
- DeepLS enables rapid scene encoding and scalable optimization, with quantitative results showing significant improvements in metrics such as Chamfer Distance and RMSE.
Deep Local Shapes (DeepLS) is a deep shape representation approach for high-fidelity 3D surface reconstruction that encodes local signed distance functions (SDFs) in a memory-efficient manner, enabling detailed reconstructions of complex scenes and objects. Unlike methods such as DeepSDF, which rely on a single global latent code per object, DeepLS partitions the scene into local regions, each represented by an independent latent code, and employs a shared multilayer perceptron (MLP) decoder. This decomposition enables scalable and efficient learning of local SDF priors for dense 3D reconstruction from partial observations and limited training data (Chabra et al., 2020).
1. Local SDF Representation
DeepLS models the SDF of a scene as a set of locally defined, continuous SDFs, each parameterized by a local latent code. Formally, let denote a shared MLP (decoder) with weights . For a voxel (local region) of side length centered at , the local latent code encodes the shape of the surface inside . Given a query point , DeepLS maps to the local coordinate frame:
The local SDF in voxel is computed as
The global SDF field is assembled by aggregating the local SDFs for all voxels covering :
The reconstructed surface is the zero level set:
2. Network Architecture
The DeepLS decoder is a fully connected MLP with four layers, each hidden layer having 128 units and LeakyReLU activation functions. The input layer concatenates the 3D local point and the local shape code , resulting in 131 input dimensions. The final output passes through a nonlinearity and is scaled to fit the SDF truncation range. In practice, the latent code is linearly embedded or concatenated at the network's first layer.
3. Scene Decomposition and Local Regions
DeepLS partitions the scene space into a regular, sparse grid of voxels (typically –$8$ cm). Latent codes are only allocated to voxels near the observed surface, determined via depth map rasterization or occupancy grid techniques. Each code is responsible for all sample points within an distance from , effectively extending the voxel’s receptive field to ensure border consistency between adjacent local SDFs. This design enables spatial overlap and local shape sharing, simplifying the learning task for the decoder.
4. Training Objective
Given a dataset of training pairs —with the ground-truth signed distance at —each is associated with all receptive fields that contain it. For voxel , define as the subset of points within its receptive field. The loss function for training is
an SDF regression term plus a Gaussian prior regularization ( typically set to 0.01).
5. Inference and Scene Encoding
To encode new observations, DeepLS fixes the learned decoder weights and optimizes only the local codes :
The optimization for each is independent and highly parallelizable. After convergence, the global SDF is evaluated by summing the local decoders, and the surface is extracted using the Marching Cubes algorithm in a narrow band near observed points.
6. Quantitative Evaluation
DeepLS achieves significant improvements in both efficiency and reconstruction fidelity compared to alternative methods. The following summarizes key results:
| Dataset / Task | Metric / Value | Reference |
|---|---|---|
| 3D Warehouse (object-level) | Chamfer Distance: DeepLS 0.03, DeepSDF 0.20 | (Chabra et al., 2020) |
| Stanford Bunny Efficiency | Full detail in 1 min (RMSE 0.03%); DeepSDF 8 days for same accuracy | (Chabra et al., 2020) |
| ICL-NUIM (synthetic scene) | Asymmetric Chamfer: TSDF fusion 5.42 mm; DeepLS 4.92 mm (higher completeness at fixed accuracy) | (Chabra et al., 2020) |
| 3D Scene Dataset (real scans) | Completion (error 7 mm): TSDF (84–91%); DeepLS (88–99%); Error: TSDF (10–14 mm), DeepLS (6–10 mm) | (Chabra et al., 2020) |
DeepLS uses 0.05 million decoder parameters and 312,000 code dimensions for 3D Warehouse experiments. At inference, a shape can be encoded in approximately one minute for 10,000 local codes using parallel Adam optimization.
7. Implementation and Memory Aspects
Meshes are preprocessed by sampling points near the surface according to uniformity (DeepSDF convention). Point sets from depth scans are augmented by sampling along estimated normals (positive/negative SDF) and free-space points along camera rays, weighted by inverse depth. DeepLS fits comfortably on modern GPUs: for 50,000 voxels with 128D codes, memory usage is roughly 25 MB. Training for 1,000 shapes requires about 12 hours on a single GPU. At test-time, all local codes for an entire scene are typically optimized within one minute, leveraging parallel code inference. The use of an extended receptive field () under ensures border consistency between overlapping voxels, eliminating the need for explicit blending mechanisms.
By balancing a small shared decoder with a large set of independent local latent codes, DeepLS exhibits high reconstruction fidelity, rapid scene encoding, and broad generalization, combining the advantages of DeepSDF’s learned priors with the scalability and efficiency benefits of sparse local representations (Chabra et al., 2020).