Papers
Topics
Authors
Recent
Search
2000 character limit reached

HoloPart Pipeline for 3D Amodal Segmentation

Updated 5 February 2026
  • The paper introduces a two-stage HoloPart pipeline that first segments visible 3D parts and then infers occluded geometry using latent diffusion.
  • It leverages novel global and local attention mechanisms to ensure both fine local details and overall shape consistency in 3D reconstructions.
  • Empirical benchmarks on ABO and PartObjaverse-Tiny show significant improvements in Chamfer Distance, IoU, and F-Score compared to prior methods.

3D part amodal segmentation aims to decompose a 3D mesh into complete, semantically meaningful parts—even those regions that are occluded in the input. The HoloPart pipeline establishes a practical two-stage approach to this problem by leveraging surface-level part segmentation and diffusion-based generative modeling for part completion, introducing new attention mechanisms and rigorously benchmarking its performance against prior methods (Yang et al., 10 Apr 2025).

1. Conceptual Foundation and Problem Definition

3D part amodal segmentation generalizes conventional part segmentation by requiring the recovery of the entirety of each semantic part, including components hidden from view or missing in the observed data. Traditional segmentation methods only provide masks for visible segments, leaving occluded regions unaccounted for. HoloPart operationalizes the generation of complete parts by introducing a pipeline that first segments visible regions and then generatively "hallucinates" the missing geometry for each part through latent diffusion modeling. The central objectives are the inference of occluded 3D geometry, preservation of global shape consistency, and robust handling of diverse object topologies with limited supervision.

2. Pipeline Architecture

The HoloPart pipeline comprises two sequential stages:

  1. Surface Segmentation: An off-the-shelf 3D part segmentation model (e.g., SAMPart3D) is applied to an input mesh MM or point cloud XX, resulting in a set of incomplete surface masks {si}\{s_i\}, each corresponding to the visible fraction of semantic part ii.
  2. Part Completion via Generative Diffusion: For each segment sis_i, the pipeline conditions on the entire shape point cloud XX and a binary mask MM that identifies sis_i. A latent diffusion model, pretrained on whole shapes and finetuned on part-whole pairs, predicts a complete 3D part pip_i by inferring its occluded geometry in a manner consistent with both local detail and global context. The completed semantic parts {pi}\{p_i\} can then be reassembled into a full, amodal part segmentation.

3. Model Components and Mechanisms

3.1 Variational Autoencoder (VAE) Pretraining

A VAE encoder-decoder is first trained on complete point clouds XRN×3X \in \mathbb{R}^{N\times3}. A set of anchor points X0X_0 is subsampled via Farthest Point Sampling (FPS), and an embedding z=E(X)Rm×dz = \mathbb{E}(X) \in \mathbb{R}^{m\times d} is constructed using cross-attention on positional embeddings. The decoder maps zz along with query points qq to occupancy logits via a hybrid of cross- and self-attention operations.

3.2 Latent Diffusion and Conditioning

A linear interpolation diffusion process is defined in latent space:

zt=(1tT)z0+tTϵ,z_t = (1-\tfrac{t}{T}) z_0 + \tfrac{t}{T} \epsilon,

for t{0,,T1}t \in \{0, \ldots, T-1\}, with ϵN(0,I)\epsilon \sim \mathcal{N}(0, I). A DiT-based denoiser vθv_\theta is trained to match the reverse vector field relative to an optional conditioning vector gg. The objective for pretraining is

Lshape=Ez0,t,ϵ[vθ(zt,t,g)(ϵz0)22].L_\mathrm{shape} = \mathbb{E}_{z_0, t, \epsilon}\left[\|v_\theta(z_t, t, g) - (\epsilon - z_0)\|^2_2\right].

3.3 Context-Aware Part Completion

After pretraining, vθv_\theta is finetuned for amodal part completion. Two attention mechanisms are crucial:

  • Global Shape Context Attention: For each incomplete segment SS, anchor points S0S_0 and the full-shape point sample XfullX_\mathrm{full}, along with a mask MM, are combined via cross-attention to produce a global context embedding coc_o.
  • Local Attention Module: Local geometric cues are encoded by subsampling SS to S0S_0 and applying cross-attention with the full incomplete segment SS, producing a local context embedding clc_l.
  • Both coc_o and clc_l are injected at separate cross-attention layers within each DiT diffusion block.

The training objective for part completion is:

Lpart=Ez0p,t,ϵ[vθ(ztp,t,co,cl)(ϵz0p)22],L_\mathrm{part} = \mathbb{E}_{z_0^p, t, \epsilon}\left[\|v_\theta(z^p_t, t, c_o, c_l) - (\epsilon - z_0^p)\|^2_2\right],

where z0p=E(K)z_0^p = \mathbb{E}(K) encodes the complete part.

4. Data Preparation and Benchmarking

4.1 Datasets

  • ABO Dataset: Includes four categories (bed, table, lamp, chair) with ground-truth part annotations, comprising 20k training parts and around 1,000 parts for evaluation.
  • PartObjaverse-Tiny: Contains 200 curated shapes across eight categories, offering approximately 3,000 parts for evaluation and a training set with 160k parts.

4.2 Data Processing

Surface mask generation involves combining all parts for a watertight mesh, ray-casting to remove occluded faces, and reconstructing both the incomplete and full part meshes. Face-level part labels are assigned, and point clouds SS and KK are sampled for training.

4.3 Baselines and Metrics

HoloPart is evaluated against PatchComplete, DiffComplete, and a finetuned VAE baseline. Metrics include L1L_1 Chamfer Distance (CD), Intersection over Union (IoU), F-Score, and reconstruction Success Rate. Points are sampled (500k) for CD, and voxel grids of 64364^3 are used for IoU and F-Score calculations.

4.4 Training Protocols

Diffusion is run for T=1000T=1000 timesteps. Key hyperparameters include 512 latent tokens per part, 20,480 global shape tokens, 8 self-attention layers for context-aware blocks, 10 DiT layers in the part diffusion U-Net (hidden size 2048), a 1e41\mathrm{e}{-4} learning rate with AdamW optimizer, and guidance scale set to 3.5 (ablation candidates: 1.5, 3.5, 5, 7.5). Training is conducted on 4 RTX4090 GPUs (ABO, 2 days) and 8 A100 GPUs (Objaverse, 4 days).

5. Algorithmic Summary and Pseudocode

The HoloPart workflow can be summarized as follows:

  1. Pretraining: Train VAE encoder/decoder E/D\mathbb{E}/\mathbb{D} on whole shapes, then train the diffusion model vθv_\theta on latent variables z=E(X)z = \mathbb{E}(X).
  2. Data Curation: Compile training triples (S,K,X,M)(S, K, X, M) for each shape.
  3. Finetuning: For each triple—extract anchor points, compute global and local context embeddings, encode the complete part, perform noise injection, and minimize the part diffusion objective.
  4. Inference: For each visible surface mask sis_i, compute context, initialize with noise, iteratively denoise via reverse diffusion, and decode the latent to reconstruct the complete part mesh.

6. Empirical Results and Ablation Analyses

HoloPart demonstrates substantial improvements over existing approaches:

Dataset Chamfer Distance ↓ IoU ↑ F-Score ↑
ABO (HoloPart) 0.026 0.771 0.848
ABO (Best Baseline) 0.068 0.241 0.380
PartObjaverse-Tiny 0.034 0.688 0.801
Baseline 0.133 0.142 0.239

Ablation studies indicate that removing global context attention increases Chamfer Distance by ~40% and reduces IoU/F-Score by 10–15%, whereas omitting local attention substantially degrades reconstruction of fine details. Experiments with guidance scales show optimal reconstruction fidelity and diversity at a value of 3.5, with lower values underfitting and higher scales causing sampling failures.

Qualitative assessments show that HoloPart can successfully reconstruct thin and complex structures (e.g., lamp supports, table legs, anatomical details) that competing methods fail to recover. Failure cases are attributable to noisy or incomplete input masks.

7. Applications and Significance

HoloPart's amodal part completion enables several downstream tasks:

  • Geometry Editing: Individual amodal parts can be carved, resized, or replaced within mesh editors.
  • Material Assignment: Completed parts can be individually textured.
  • Animation: Enables rigging and animating reconstructed parts (e.g., wheels, doors).
  • Part Super-resolution: The pipeline supports high-detail part synthesis by allocating token budget to parts as with whole shapes.

By combining pretrained 3D diffusion priors with context- and part-aware attention, HoloPart advances the state of the art in amodal 3D part segmentation, with evidence demonstrated on new benchmarks and comprehensive ablation studies. These contributions highlight the critical roles of global and local context encoding in the generative reconstruction of occluded 3D parts (Yang et al., 10 Apr 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HoloPart Pipeline.