HoloPart Pipeline for 3D Amodal Segmentation
- The paper introduces a two-stage HoloPart pipeline that first segments visible 3D parts and then infers occluded geometry using latent diffusion.
- It leverages novel global and local attention mechanisms to ensure both fine local details and overall shape consistency in 3D reconstructions.
- Empirical benchmarks on ABO and PartObjaverse-Tiny show significant improvements in Chamfer Distance, IoU, and F-Score compared to prior methods.
3D part amodal segmentation aims to decompose a 3D mesh into complete, semantically meaningful parts—even those regions that are occluded in the input. The HoloPart pipeline establishes a practical two-stage approach to this problem by leveraging surface-level part segmentation and diffusion-based generative modeling for part completion, introducing new attention mechanisms and rigorously benchmarking its performance against prior methods (Yang et al., 10 Apr 2025).
1. Conceptual Foundation and Problem Definition
3D part amodal segmentation generalizes conventional part segmentation by requiring the recovery of the entirety of each semantic part, including components hidden from view or missing in the observed data. Traditional segmentation methods only provide masks for visible segments, leaving occluded regions unaccounted for. HoloPart operationalizes the generation of complete parts by introducing a pipeline that first segments visible regions and then generatively "hallucinates" the missing geometry for each part through latent diffusion modeling. The central objectives are the inference of occluded 3D geometry, preservation of global shape consistency, and robust handling of diverse object topologies with limited supervision.
2. Pipeline Architecture
The HoloPart pipeline comprises two sequential stages:
- Surface Segmentation: An off-the-shelf 3D part segmentation model (e.g., SAMPart3D) is applied to an input mesh or point cloud , resulting in a set of incomplete surface masks , each corresponding to the visible fraction of semantic part .
- Part Completion via Generative Diffusion: For each segment , the pipeline conditions on the entire shape point cloud and a binary mask that identifies . A latent diffusion model, pretrained on whole shapes and finetuned on part-whole pairs, predicts a complete 3D part by inferring its occluded geometry in a manner consistent with both local detail and global context. The completed semantic parts can then be reassembled into a full, amodal part segmentation.
3. Model Components and Mechanisms
3.1 Variational Autoencoder (VAE) Pretraining
A VAE encoder-decoder is first trained on complete point clouds . A set of anchor points is subsampled via Farthest Point Sampling (FPS), and an embedding is constructed using cross-attention on positional embeddings. The decoder maps along with query points to occupancy logits via a hybrid of cross- and self-attention operations.
3.2 Latent Diffusion and Conditioning
A linear interpolation diffusion process is defined in latent space:
for , with . A DiT-based denoiser is trained to match the reverse vector field relative to an optional conditioning vector . The objective for pretraining is
3.3 Context-Aware Part Completion
After pretraining, is finetuned for amodal part completion. Two attention mechanisms are crucial:
- Global Shape Context Attention: For each incomplete segment , anchor points and the full-shape point sample , along with a mask , are combined via cross-attention to produce a global context embedding .
- Local Attention Module: Local geometric cues are encoded by subsampling to and applying cross-attention with the full incomplete segment , producing a local context embedding .
- Both and are injected at separate cross-attention layers within each DiT diffusion block.
The training objective for part completion is:
where encodes the complete part.
4. Data Preparation and Benchmarking
4.1 Datasets
- ABO Dataset: Includes four categories (bed, table, lamp, chair) with ground-truth part annotations, comprising 20k training parts and around 1,000 parts for evaluation.
- PartObjaverse-Tiny: Contains 200 curated shapes across eight categories, offering approximately 3,000 parts for evaluation and a training set with 160k parts.
4.2 Data Processing
Surface mask generation involves combining all parts for a watertight mesh, ray-casting to remove occluded faces, and reconstructing both the incomplete and full part meshes. Face-level part labels are assigned, and point clouds and are sampled for training.
4.3 Baselines and Metrics
HoloPart is evaluated against PatchComplete, DiffComplete, and a finetuned VAE baseline. Metrics include Chamfer Distance (CD), Intersection over Union (IoU), F-Score, and reconstruction Success Rate. Points are sampled (500k) for CD, and voxel grids of are used for IoU and F-Score calculations.
4.4 Training Protocols
Diffusion is run for timesteps. Key hyperparameters include 512 latent tokens per part, 20,480 global shape tokens, 8 self-attention layers for context-aware blocks, 10 DiT layers in the part diffusion U-Net (hidden size 2048), a learning rate with AdamW optimizer, and guidance scale set to 3.5 (ablation candidates: 1.5, 3.5, 5, 7.5). Training is conducted on 4 RTX4090 GPUs (ABO, 2 days) and 8 A100 GPUs (Objaverse, 4 days).
5. Algorithmic Summary and Pseudocode
The HoloPart workflow can be summarized as follows:
- Pretraining: Train VAE encoder/decoder on whole shapes, then train the diffusion model on latent variables .
- Data Curation: Compile training triples for each shape.
- Finetuning: For each triple—extract anchor points, compute global and local context embeddings, encode the complete part, perform noise injection, and minimize the part diffusion objective.
- Inference: For each visible surface mask , compute context, initialize with noise, iteratively denoise via reverse diffusion, and decode the latent to reconstruct the complete part mesh.
6. Empirical Results and Ablation Analyses
HoloPart demonstrates substantial improvements over existing approaches:
| Dataset | Chamfer Distance ↓ | IoU ↑ | F-Score ↑ |
|---|---|---|---|
| ABO (HoloPart) | 0.026 | 0.771 | 0.848 |
| ABO (Best Baseline) | 0.068 | 0.241 | 0.380 |
| PartObjaverse-Tiny | 0.034 | 0.688 | 0.801 |
| Baseline | 0.133 | 0.142 | 0.239 |
Ablation studies indicate that removing global context attention increases Chamfer Distance by ~40% and reduces IoU/F-Score by 10–15%, whereas omitting local attention substantially degrades reconstruction of fine details. Experiments with guidance scales show optimal reconstruction fidelity and diversity at a value of 3.5, with lower values underfitting and higher scales causing sampling failures.
Qualitative assessments show that HoloPart can successfully reconstruct thin and complex structures (e.g., lamp supports, table legs, anatomical details) that competing methods fail to recover. Failure cases are attributable to noisy or incomplete input masks.
7. Applications and Significance
HoloPart's amodal part completion enables several downstream tasks:
- Geometry Editing: Individual amodal parts can be carved, resized, or replaced within mesh editors.
- Material Assignment: Completed parts can be individually textured.
- Animation: Enables rigging and animating reconstructed parts (e.g., wheels, doors).
- Part Super-resolution: The pipeline supports high-detail part synthesis by allocating token budget to parts as with whole shapes.
By combining pretrained 3D diffusion priors with context- and part-aware attention, HoloPart advances the state of the art in amodal 3D part segmentation, with evidence demonstrated on new benchmarks and comprehensive ablation studies. These contributions highlight the critical roles of global and local context encoding in the generative reconstruction of occluded 3D parts (Yang et al., 10 Apr 2025).