HoloPart Pipeline for 3D Amodal Segmentation

Updated 5 February 2026

The paper introduces a two-stage HoloPart pipeline that first segments visible 3D parts and then infers occluded geometry using latent diffusion.
It leverages novel global and local attention mechanisms to ensure both fine local details and overall shape consistency in 3D reconstructions.
Empirical benchmarks on ABO and PartObjaverse-Tiny show significant improvements in Chamfer Distance, IoU, and F-Score compared to prior methods.

3D part amodal segmentation aims to decompose a 3D mesh into complete, semantically meaningful parts—even those regions that are occluded in the input. The HoloPart pipeline establishes a practical two-stage approach to this problem by leveraging surface-level part segmentation and diffusion-based generative modeling for part completion, introducing new attention mechanisms and rigorously benchmarking its performance against prior methods (Yang et al., 10 Apr 2025).

1. Conceptual Foundation and Problem Definition

3D part amodal segmentation generalizes conventional part segmentation by requiring the recovery of the entirety of each semantic part, including components hidden from view or missing in the observed data. Traditional segmentation methods only provide masks for visible segments, leaving occluded regions unaccounted for. HoloPart operationalizes the generation of complete parts by introducing a pipeline that first segments visible regions and then generatively "hallucinates" the missing geometry for each part through latent diffusion modeling. The central objectives are the inference of occluded 3D geometry, preservation of global shape consistency, and robust handling of diverse object topologies with limited supervision.

2. Pipeline Architecture

The HoloPart pipeline comprises two sequential stages:

Surface Segmentation: An off-the-shelf 3D part segmentation model (e.g., SAMPart3D) is applied to an input mesh $M$ or point cloud $X$ , resulting in a set of incomplete surface masks $\{s_i\}$ , each corresponding to the visible fraction of semantic part $i$ .
Part Completion via Generative Diffusion: For each segment $s_i$ , the pipeline conditions on the entire shape point cloud $X$ and a binary mask $M$ that identifies $s_i$ . A latent diffusion model, pretrained on whole shapes and finetuned on part-whole pairs, predicts a complete 3D part $p_i$ by inferring its occluded geometry in a manner consistent with both local detail and global context. The completed semantic parts $\{p_i\}$ can then be reassembled into a full, amodal part segmentation.

3. Model Components and Mechanisms

3.1 Variational Autoencoder (VAE) Pretraining

A VAE encoder-decoder is first trained on complete point clouds $X \in \mathbb{R}^{N\times3}$ . A set of anchor points $X_0$ is subsampled via Farthest Point Sampling (FPS), and an embedding $z = \mathbb{E}(X) \in \mathbb{R}^{m\times d}$ is constructed using cross-attention on positional embeddings. The decoder maps $z$ along with query points $q$ to occupancy logits via a hybrid of cross- and self-attention operations.

3.2 Latent Diffusion and Conditioning

A linear interpolation diffusion process is defined in latent space:

$z_t = (1-\tfrac{t}{T}) z_0 + \tfrac{t}{T} \epsilon,$

for $t \in \{0, \ldots, T-1\}$ , with $\epsilon \sim \mathcal{N}(0, I)$ . A DiT-based denoiser $v_\theta$ is trained to match the reverse vector field relative to an optional conditioning vector $g$ . The objective for pretraining is

$L_\mathrm{shape} = \mathbb{E}_{z_0, t, \epsilon}\left[\|v_\theta(z_t, t, g) - (\epsilon - z_0)\|^2_2\right].$

3.3 Context-Aware Part Completion

After pretraining, $v_\theta$ is finetuned for amodal part completion. Two attention mechanisms are crucial:

Global Shape Context Attention: For each incomplete segment $S$ , anchor points $S_0$ and the full-shape point sample $X_\mathrm{full}$ , along with a mask $M$ , are combined via cross-attention to produce a global context embedding $c_o$ .
Local Attention Module: Local geometric cues are encoded by subsampling $S$ to $S_0$ and applying cross-attention with the full incomplete segment $S$ , producing a local context embedding $c_l$ .
Both $c_o$ and $c_l$ are injected at separate cross-attention layers within each DiT diffusion block.

The training objective for part completion is:

$L_\mathrm{part} = \mathbb{E}_{z_0^p, t, \epsilon}\left[\|v_\theta(z^p_t, t, c_o, c_l) - (\epsilon - z_0^p)\|^2_2\right],$

where $z_0^p = \mathbb{E}(K)$ encodes the complete part.

4. Data Preparation and Benchmarking

4.1 Datasets

ABO Dataset: Includes four categories (bed, table, lamp, chair) with ground-truth part annotations, comprising 20k training parts and around 1,000 parts for evaluation.
PartObjaverse-Tiny: Contains 200 curated shapes across eight categories, offering approximately 3,000 parts for evaluation and a training set with 160k parts.

4.2 Data Processing

Surface mask generation involves combining all parts for a watertight mesh, ray-casting to remove occluded faces, and reconstructing both the incomplete and full part meshes. Face-level part labels are assigned, and point clouds $S$ and $K$ are sampled for training.

4.3 Baselines and Metrics

HoloPart is evaluated against PatchComplete, DiffComplete, and a finetuned VAE baseline. Metrics include $L_1$ Chamfer Distance (CD), Intersection over Union (IoU), F-Score, and reconstruction Success Rate. Points are sampled (500k) for CD, and voxel grids of $64^3$ are used for IoU and F-Score calculations.

4.4 Training Protocols

Diffusion is run for $T=1000$ timesteps. Key hyperparameters include 512 latent tokens per part, 20,480 global shape tokens, 8 self-attention layers for context-aware blocks, 10 DiT layers in the part diffusion U-Net (hidden size 2048), a $1\mathrm{e}{-4}$ learning rate with AdamW optimizer, and guidance scale set to 3.5 (ablation candidates: 1.5, 3.5, 5, 7.5). Training is conducted on 4 RTX4090 GPUs (ABO, 2 days) and 8 A100 GPUs (Objaverse, 4 days).

5. Algorithmic Summary and Pseudocode

The HoloPart workflow can be summarized as follows:

Pretraining: Train VAE encoder/decoder $\mathbb{E}/\mathbb{D}$ on whole shapes, then train the diffusion model $v_\theta$ on latent variables $z = \mathbb{E}(X)$ .
Data Curation: Compile training triples $(S, K, X, M)$ for each shape.
Finetuning: For each triple—extract anchor points, compute global and local context embeddings, encode the complete part, perform noise injection, and minimize the part diffusion objective.
Inference: For each visible surface mask $s_i$ , compute context, initialize with noise, iteratively denoise via reverse diffusion, and decode the latent to reconstruct the complete part mesh.

6. Empirical Results and Ablation Analyses

HoloPart demonstrates substantial improvements over existing approaches:

Dataset	Chamfer Distance ↓	IoU ↑	F-Score ↑
ABO (HoloPart)	0.026	0.771	0.848
ABO (Best Baseline)	0.068	0.241	0.380
PartObjaverse-Tiny	0.034	0.688	0.801
Baseline	0.133	0.142	0.239

Ablation studies indicate that removing global context attention increases Chamfer Distance by ~40% and reduces IoU/F-Score by 10–15%, whereas omitting local attention substantially degrades reconstruction of fine details. Experiments with guidance scales show optimal reconstruction fidelity and diversity at a value of 3.5, with lower values underfitting and higher scales causing sampling failures.

Qualitative assessments show that HoloPart can successfully reconstruct thin and complex structures (e.g., lamp supports, table legs, anatomical details) that competing methods fail to recover. Failure cases are attributable to noisy or incomplete input masks.

7. Applications and Significance

HoloPart's amodal part completion enables several downstream tasks:

Geometry Editing: Individual amodal parts can be carved, resized, or replaced within mesh editors.
Material Assignment: Completed parts can be individually textured.
Animation: Enables rigging and animating reconstructed parts (e.g., wheels, doors).
Part Super-resolution: The pipeline supports high-detail part synthesis by allocating token budget to parts as with whole shapes.

By combining pretrained 3D diffusion priors with context- and part-aware attention, HoloPart advances the state of the art in amodal 3D part segmentation, with evidence demonstrated on new benchmarks and comprehensive ablation studies. These contributions highlight the critical roles of global and local context encoding in the generative reconstruction of occluded 3D parts (Yang et al., 10 Apr 2025).

Markdown Report Issue Upgrade to Chat

References (1)

HoloPart: Generative 3D Part Amodal Segmentation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HoloPart Pipeline.