AmodalGen3D: Generative Amodal 3D Reconstruction

Updated 5 February 2026

AmodalGen3D is a generative framework that combines 2D amodal completion priors with explicit multi-view stereo constraints for high-fidelity, occlusion-aware 3D reconstruction.
It leverages specialized cross-attention modules to fuse view-specific features and geometry tokens, ensuring consistent synthesis of both visible and occluded object regions.
Quantitative evaluations show improved FID, MMD, and CLIP-Score metrics, demonstrating its practical relevance for robotics, AR/VR, and embodied AI applications.

AmodalGen3D is a generative amodal 3D object reconstruction framework designed to infer complete, occlusion-free geometry and appearance from highly sparse, unposed, and partially occluded views. It fuses data-driven 2D amodal completion priors with explicit multi-view stereo (MVS) geometric constraints using specialized cross-attention modules, producing consistent reconstructions of both visible and unobserved object regions. Addressing the longstanding challenge of amodal reconstruction in scenarios where substantial surfaces are never directly observed, AmodalGen3D demonstrates state-of-the-art performance under real-world occlusion and view sparsity, with particular relevance to robotics, AR/VR, and embodied AI applications (Zhou et al., 26 Nov 2025).

1. Generative Model Formulation

AmodalGen3D adopts a conditional generative flow-matching framework for high-fidelity amodal object reconstruction from sparse inputs. Let $z_0 \sim p_0(z_0)$ denote the data latent (extracted from a pretrained SLAT Transformer), and $\epsilon \sim \mathcal{N}(0,I)$ be Gaussian noise. Latent interpolation follows: $z(t) = (1-t) z_0 + t \epsilon, \quad t \in [0,1]$ The vector field $v_\theta(z,t)$ is learned such that time-reversed integration maps from noise to the data latent. Training minimizes the conditional flow-matching loss: $L_\mathrm{CFM}(\theta) = \mathbb{E}_{t, z_0, \epsilon} \left\| v_\theta(z(t), t) - (\epsilon - z_0) \right\|_2^2$ All input-view and partial-geometry conditioning is implicit in $v_\theta$ ’s computation graph. This mechanism enables the model to synthesize 3D geometry and appearance distributionally consistent with the supplied partial views, while hallucinating plausible content for unobserved or occluded regions (Zhou et al., 26 Nov 2025).

2. Architecture and Attention Mechanisms

AmodalGen3D integrates two core modules for view-feature fusion and geometry-guided inference:

View-Wise Cross Attention (VW-CA): Unlike conventional cross-attention, which alternates focus across views and may introduce perspective bias, VW-CA attends to all $K$ inpainted 2D views $\{i^n\}_n$ in parallel. DINO features $c^n = \mathrm{DINO}(i^n) \in \mathbb{R}^{L \times D}$ are extracted per view.

For each view:

$z^{n\prime} = \mathrm{CA}(z, c^n), \quad \mathrm{CA}(Q,K,V) = \mathrm{softmax}(QK^\top/\sqrt{D})V$

Weighted averaging, with each view’s proportion of visible ( $m_v^n$ ) and occlusion ( $m_o^n$ ) masks, fuses these attention outputs:

$\tau_n = m_v^n / (m_v^n + m_o^n), \quad w_n = \tau_n / \sum_{j=1}^K \tau_j, \quad z' = (1/K)\sum_{n=1}^K w_n z^{n\prime}$

LayerNorm ensures stability.

Stereo-Conditioned Cross Attention (SCCA): To encode explicit 3D geometric context, an off-the-shelf MVS model generates per-view point clouds $\{p^n\}$ . Points projecting into the object’s visibility mask are voxelized ( $\phi=64^3$ ), encoded (sparse-VAE), patchified, and projected to create geometry tokens $c_\mathrm{geo}$ . An MLP gating vector $g = \sigma(\mathrm{MLP}_g(c_\mathrm{geo}))$ is used to modulate attention logits:

$\alpha = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{D} + \log(g + \epsilon)}\right)$

This structure-aware weighting enables the network to focus on spatially confident features when inferring occluded structure (Zhou et al., 26 Nov 2025).

3. Data Conditioning Pipeline

The conditioning sequence in AmodalGen3D involves:

2D Amodal Completion: Input images and masks are processed via a 2D amodal inpainting model (OAAC [Ao et al. 2025]), yielding occlusion-free estimates $\{i^n\}$ and associated visibility/occlusion masks for each view.
View Feature Extraction: Completed images are encoded by DINO for use in VW-CA.
MVS Geometry Extraction: The original (possibly occluded) images are fed to the MVS pipeline, producing partial point clouds strictly within visible regions. These are voxelized, encoded, and tokenized for SCCA. During inference on scene-level inputs, object-level point clouds can be extracted via segmentation and their masks (Zhou et al., 26 Nov 2025).

4. Training Regimen and Loss Functions

AmodalGen3D is trained end-to-end using only the conditional flow-matching loss ( $L_\mathrm{CFM}$ ), with no adversarial or explicit reconstruction penalties. The transformer backbone and attention mechanisms are sufficient to capture the visible-unobserved region relationships necessary for high-fidelity, amodal prediction.

Training details:

Datasets: Meshes from 3D-FUTURE and ABO, converted to watertight manifolds. Occlusion patterns are simulated by diffusing face seeds until 20–60% of the surface is masked. Objects are rendered from about 90 views, with 1–4 randomly sampled views/batch.
MVS Processing: The $\pi^3$ model generates per-view clouds, normalized and voxelized at $64^3$ .
Optimization: AdamW, learning rate $5\times 10^{-5}$ , default weight decay, batch size 32 across 8 $\times$ NVIDIA RTX 6000, 12 epochs ( $\sim$ 16h runtime). Classifier-free guidance is employed with a dropout rate of 0.1.
Augmentations: Erosion and distortion of masks to diversify occlusion patterns (Zhou et al., 26 Nov 2025).

5. Quantitative and Qualitative Evaluation

AmodalGen3D is evaluated primarily on the GSO test set under heavy occlusion and varying view counts. Key metrics:

FID (Fréchet Inception Distance, ↓): Assesses distributional similarity to ground truth renders (8 views per object).
KID (Kernel Inception Distance, ↓): Alternative to FID.
CLIP-Score (↑): Mean cosine similarity between CLIP embeddings of generated and inpainted input images.
MMD (Minimum Matching Distance, ↓) and COV (Coverage, ↑): Chamfer distance-based point cloud comparison.

Performance summary with 1, 2, and 4 views (selected metrics): | #Views | FID | KID | CLIP | MMD | COV | |:------:|:-----:|:------:|:------:|:------:|:------:| | 1 | 33.91 | 0.46% | 82.23% | 5.68‰ | 39.19% | | 2 | 32.12 | 0.45% | 82.33% | 5.50‰ | 39.08% | | 4 | 30.73 | 0.43% | 82.53% | 5.48‰ | 39.48% |

AmodalGen3D outperforms or closely matches state-of-the-art baselines (TRELLIS, FreeSplatter, Amodal3R) across all metrics, and demonstrates monotonic improvement as additional views are provided. The system is robust up to 20 input views (FID drops to 29.43), and is agnostic to specific 2D completions (OAAC, pix2gestalt, Flux).

Ablation studies quantify module necessity:

Gating-MLP removal: $+0.55$ FID, $+0.08$ MMD
VW-CA removal: $+6.42$ FID
SCCA removal: $+10.80$ FID, $+0.41$ MMD
Both attentions removed: $+14.11$ FID

Qualitatively, AmodalGen3D produces coherent, watertight meshes and plausible hallucination of occluded regions, where prior methods yield incomplete or geometrically inconsistent results, both in synthetic datasets and real-world imagery (Mip-NeRF 360, COCO, Hypersim) (Zhou et al., 26 Nov 2025).

6. Applications and Comparative Context

AmodalGen3D’s ability to synthesize complete 3D object representations from sparsely sampled, heavily occluded views addresses a core problem in robotics (where objects are often only partially seen), AR/VR asset generation, and embodied AI environments.

Comparable generative amodal reconstruction systems include:

Amodal3R: Utilizes mask-weighted and occlusion-aware cross-attention in a 3D latent diffusion backbone, operating directly in 3D latent space and outperforming two-stage 2D-to-3D pipelines in occlusion-aware scenarios (Wu et al., 17 Mar 2025).
ARM: Incorporates stability and connectivity shape priors for physically grounded, amodal mesh reconstruction, directly targeting deformable and manipulation-critical robotic applications (Agnew et al., 2020).

AmodalGen3D distinguishes itself by its explicit fusion of 2D amodal priors and multi-view stereo geometry via cross-attention, and by its robustness to severe real-world occlusion, lack of camera pose information, and arbitrary input view selection (Zhou et al., 26 Nov 2025).

7. Limitations and Future Directions

The model’s reliance on learned generative priors can result in hallucinated geometry that may not correspond exactly to ground truth under extreme view sparsity or weak prior distributions. MVS-based geometric conditioning, while powerful, is susceptible to noise and misalignment, though the model leverages its learned multimodal prior for partial self-correction.

Identified directions for enhancement include integration of stronger 2D amodal completion methods, refinement of stereo geometric modules to better handle noisy or ambiguous scenes, and extension to full-scene or dynamic-object settings, with emphasis on advanced occlusion reasoning and multi-object coherence (Zhou et al., 26 Nov 2025).