Papers
Topics
Authors
Recent
Search
2000 character limit reached

AmodalGen3D: Generative Amodal 3D Reconstruction

Updated 5 February 2026
  • AmodalGen3D is a generative framework that combines 2D amodal completion priors with explicit multi-view stereo constraints for high-fidelity, occlusion-aware 3D reconstruction.
  • It leverages specialized cross-attention modules to fuse view-specific features and geometry tokens, ensuring consistent synthesis of both visible and occluded object regions.
  • Quantitative evaluations show improved FID, MMD, and CLIP-Score metrics, demonstrating its practical relevance for robotics, AR/VR, and embodied AI applications.

AmodalGen3D is a generative amodal 3D object reconstruction framework designed to infer complete, occlusion-free geometry and appearance from highly sparse, unposed, and partially occluded views. It fuses data-driven 2D amodal completion priors with explicit multi-view stereo (MVS) geometric constraints using specialized cross-attention modules, producing consistent reconstructions of both visible and unobserved object regions. Addressing the longstanding challenge of amodal reconstruction in scenarios where substantial surfaces are never directly observed, AmodalGen3D demonstrates state-of-the-art performance under real-world occlusion and view sparsity, with particular relevance to robotics, AR/VR, and embodied AI applications (Zhou et al., 26 Nov 2025).

1. Generative Model Formulation

AmodalGen3D adopts a conditional generative flow-matching framework for high-fidelity amodal object reconstruction from sparse inputs. Let z0p0(z0)z_0 \sim p_0(z_0) denote the data latent (extracted from a pretrained SLAT Transformer), and ϵN(0,I)\epsilon \sim \mathcal{N}(0,I) be Gaussian noise. Latent interpolation follows: z(t)=(1t)z0+tϵ,t[0,1]z(t) = (1-t) z_0 + t \epsilon, \quad t \in [0,1] The vector field vθ(z,t)v_\theta(z,t) is learned such that time-reversed integration maps from noise to the data latent. Training minimizes the conditional flow-matching loss: LCFM(θ)=Et,z0,ϵvθ(z(t),t)(ϵz0)22L_\mathrm{CFM}(\theta) = \mathbb{E}_{t, z_0, \epsilon} \left\| v_\theta(z(t), t) - (\epsilon - z_0) \right\|_2^2 All input-view and partial-geometry conditioning is implicit in vθv_\theta’s computation graph. This mechanism enables the model to synthesize 3D geometry and appearance distributionally consistent with the supplied partial views, while hallucinating plausible content for unobserved or occluded regions (Zhou et al., 26 Nov 2025).

2. Architecture and Attention Mechanisms

AmodalGen3D integrates two core modules for view-feature fusion and geometry-guided inference:

  • View-Wise Cross Attention (VW-CA): Unlike conventional cross-attention, which alternates focus across views and may introduce perspective bias, VW-CA attends to all KK inpainted 2D views {in}n\{i^n\}_n in parallel. DINO features cn=DINO(in)RL×Dc^n = \mathrm{DINO}(i^n) \in \mathbb{R}^{L \times D} are extracted per view.

For each view:

zn=CA(z,cn),CA(Q,K,V)=softmax(QK/D)Vz^{n\prime} = \mathrm{CA}(z, c^n), \quad \mathrm{CA}(Q,K,V) = \mathrm{softmax}(QK^\top/\sqrt{D})V

Weighted averaging, with each view’s proportion of visible (mvnm_v^n) and occlusion (monm_o^n) masks, fuses these attention outputs:

τn=mvn/(mvn+mon),wn=τn/j=1Kτj,z=(1/K)n=1Kwnzn\tau_n = m_v^n / (m_v^n + m_o^n), \quad w_n = \tau_n / \sum_{j=1}^K \tau_j, \quad z' = (1/K)\sum_{n=1}^K w_n z^{n\prime}

LayerNorm ensures stability.

  • Stereo-Conditioned Cross Attention (SCCA): To encode explicit 3D geometric context, an off-the-shelf MVS model generates per-view point clouds {pn}\{p^n\}. Points projecting into the object’s visibility mask are voxelized (ϕ=643\phi=64^3), encoded (sparse-VAE), patchified, and projected to create geometry tokens cgeoc_\mathrm{geo}. An MLP gating vector g=σ(MLPg(cgeo))g = \sigma(\mathrm{MLP}_g(c_\mathrm{geo})) is used to modulate attention logits:

α=softmax(QKD+log(g+ϵ))\alpha = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{D} + \log(g + \epsilon)}\right)

This structure-aware weighting enables the network to focus on spatially confident features when inferring occluded structure (Zhou et al., 26 Nov 2025).

3. Data Conditioning Pipeline

The conditioning sequence in AmodalGen3D involves:

  • 2D Amodal Completion: Input images and masks are processed via a 2D amodal inpainting model (OAAC [Ao et al. 2025]), yielding occlusion-free estimates {in}\{i^n\} and associated visibility/occlusion masks for each view.
  • View Feature Extraction: Completed images are encoded by DINO for use in VW-CA.
  • MVS Geometry Extraction: The original (possibly occluded) images are fed to the MVS pipeline, producing partial point clouds strictly within visible regions. These are voxelized, encoded, and tokenized for SCCA. During inference on scene-level inputs, object-level point clouds can be extracted via segmentation and their masks (Zhou et al., 26 Nov 2025).

4. Training Regimen and Loss Functions

AmodalGen3D is trained end-to-end using only the conditional flow-matching loss (LCFML_\mathrm{CFM}), with no adversarial or explicit reconstruction penalties. The transformer backbone and attention mechanisms are sufficient to capture the visible-unobserved region relationships necessary for high-fidelity, amodal prediction.

Training details:

  • Datasets: Meshes from 3D-FUTURE and ABO, converted to watertight manifolds. Occlusion patterns are simulated by diffusing face seeds until 20–60% of the surface is masked. Objects are rendered from about 90 views, with 1–4 randomly sampled views/batch.
  • MVS Processing: The π3\pi^3 model generates per-view clouds, normalized and voxelized at 64364^3.
  • Optimization: AdamW, learning rate 5×1055\times 10^{-5}, default weight decay, batch size 32 across 8×\timesNVIDIA RTX 6000, 12 epochs (\sim16h runtime). Classifier-free guidance is employed with a dropout rate of 0.1.
  • Augmentations: Erosion and distortion of masks to diversify occlusion patterns (Zhou et al., 26 Nov 2025).

5. Quantitative and Qualitative Evaluation

AmodalGen3D is evaluated primarily on the GSO test set under heavy occlusion and varying view counts. Key metrics:

  • FID (Fréchet Inception Distance, ↓): Assesses distributional similarity to ground truth renders (8 views per object).
  • KID (Kernel Inception Distance, ↓): Alternative to FID.
  • CLIP-Score (↑): Mean cosine similarity between CLIP embeddings of generated and inpainted input images.
  • MMD (Minimum Matching Distance, ↓) and COV (Coverage, ↑): Chamfer distance-based point cloud comparison.

Performance summary with 1, 2, and 4 views (selected metrics): | #Views | FID | KID | CLIP | MMD | COV | |:------:|:-----:|:------:|:------:|:------:|:------:| | 1 | 33.91 | 0.46% | 82.23% | 5.68‰ | 39.19% | | 2 | 32.12 | 0.45% | 82.33% | 5.50‰ | 39.08% | | 4 | 30.73 | 0.43% | 82.53% | 5.48‰ | 39.48% |

AmodalGen3D outperforms or closely matches state-of-the-art baselines (TRELLIS, FreeSplatter, Amodal3R) across all metrics, and demonstrates monotonic improvement as additional views are provided. The system is robust up to 20 input views (FID drops to 29.43), and is agnostic to specific 2D completions (OAAC, pix2gestalt, Flux).

Ablation studies quantify module necessity:

  • Gating-MLP removal: +0.55+0.55 FID, +0.08+0.08 MMD
  • VW-CA removal: +6.42+6.42 FID
  • SCCA removal: +10.80+10.80 FID, +0.41+0.41 MMD
  • Both attentions removed: +14.11+14.11 FID

Qualitatively, AmodalGen3D produces coherent, watertight meshes and plausible hallucination of occluded regions, where prior methods yield incomplete or geometrically inconsistent results, both in synthetic datasets and real-world imagery (Mip-NeRF 360, COCO, Hypersim) (Zhou et al., 26 Nov 2025).

6. Applications and Comparative Context

AmodalGen3D’s ability to synthesize complete 3D object representations from sparsely sampled, heavily occluded views addresses a core problem in robotics (where objects are often only partially seen), AR/VR asset generation, and embodied AI environments.

Comparable generative amodal reconstruction systems include:

  • Amodal3R: Utilizes mask-weighted and occlusion-aware cross-attention in a 3D latent diffusion backbone, operating directly in 3D latent space and outperforming two-stage 2D-to-3D pipelines in occlusion-aware scenarios (Wu et al., 17 Mar 2025).
  • ARM: Incorporates stability and connectivity shape priors for physically grounded, amodal mesh reconstruction, directly targeting deformable and manipulation-critical robotic applications (Agnew et al., 2020).

AmodalGen3D distinguishes itself by its explicit fusion of 2D amodal priors and multi-view stereo geometry via cross-attention, and by its robustness to severe real-world occlusion, lack of camera pose information, and arbitrary input view selection (Zhou et al., 26 Nov 2025).

7. Limitations and Future Directions

The model’s reliance on learned generative priors can result in hallucinated geometry that may not correspond exactly to ground truth under extreme view sparsity or weak prior distributions. MVS-based geometric conditioning, while powerful, is susceptible to noise and misalignment, though the model leverages its learned multimodal prior for partial self-correction.

Identified directions for enhancement include integration of stronger 2D amodal completion methods, refinement of stereo geometric modules to better handle noisy or ambiguous scenes, and extension to full-scene or dynamic-object settings, with emphasis on advanced occlusion reasoning and multi-object coherence (Zhou et al., 26 Nov 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AmodalGen3D.