Volumetric Diffusion Models

Updated 8 January 2026

Volumetric diffusion models are deep generative models that extend DDPMs to 3D data by applying a progressive, stochastic denoising process across voxel grids, meshes, and latent spaces.
They employ specialized architectures such as 3D U-Nets, latent space diffusion, and primitive-based techniques to efficiently generate and refine high-resolution 3D structures in applications like medical imaging and computer vision.
Despite challenges like high computational demands and maintaining volumetric consistency, innovative patch-wise training and conditioning strategies enable practical large-scale and inverse problem applications.

Volumetric diffusion models are a class of deep generative models that extend denoising diffusion probabilistic models (DDPMs) and score-based generative models to three-dimensional (3D) data domains. They are increasingly critical in applications involving 3D generation, image-to-image translation, inverse problems, and high-fidelity scene synthesis across computer vision, graphics, and medical imaging. The hallmark of these models is propagating a stochastic, progressive denoising process through 3D volumetric representations—voxels, meshes, signed distance fields, or intermediate latent spaces—enabling plausible, high-resolution synthesis and reconstruction with robust data-driven priors.

1. Mathematical Formulation and Model Variants

Volumetric diffusion models generalize the standard DDPM framework to 3D tensors. The forward process corrupts a data tensor $x_0 \in \mathbb{R}^{C\times H\times W\times D}$ (where $C$ denotes channels, and $H,W,D$ are spatial dimensions) via a Markovian noise-injection scheme: $q(x_t\mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$ with a user-defined variance schedule $\{\beta_t\}_{t=1}^T$ . Marginally, one obtains: $q(x_t|x_0) = \mathcal{N}\big(x_t; \sqrt{\bar\alpha_t} x_0, (1-\bar\alpha_t)I\big), \quad \bar\alpha_t = \prod_{s=1}^t (1-\beta_s)$ The reverse process is parameterized by a neural network, often a 3D U-Net (or derivative), that learns to denoise $x_t$ : $p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t,t), \sigma_t^2 I)$ with mean $\mu_\theta$ given by the DDPM noise prediction formula. In the continuous-time formulation, the forward and reverse processes become stochastic differential equations (SDEs), with corresponding score-matching objectives.

Key architectural variants include:

Direct volumetric DDPMs operating on voxel grids or label probability volumes (Ahn et al., 2024, Xing et al., 2023);
Latent space DDPMs, where a compact volumetric or autoencoded representation is learned and diffusion is performed in that lower-dimensional space (Ntavelis et al., 2023, Zhu et al., 2023);
Primitive-based diffusion, where diffusion is performed on a parameter set defining 3D geometric primitives (e.g., sets of oriented cubes), enabling highly efficient and expressive modeling (Chen et al., 2023).

2. Network Architectures and Conditioning Strategies

The canonical backbone for volumetric diffusion is a 3D U-Net, sometimes augmented by self- and cross-attention, hierarchical latent fusion, and explicit time-step embeddings. Memory and computational efficiency in 3D are major challenges, necessitating innovations such as:

Patch-wise 3D training (enabling larger effective fields-of-view while controlling memory footprints) (Yoon et al., 2024);
Lightweight "volumetric layers" (1D convs along the slice axis) grafted onto pretrained 2D backbones for hybrid 2D-3D approaches (Zhu et al., 2023, Zhu et al., 13 Jan 2025);
Volumetric Conditioning Modules (VCM) enabling spatially controllable conditional generation and multimodal conditioning via asymmetric shallow 3D U-Nets "on top" of a frozen pretrained backbone (Ahn et al., 2024).

Conditioning information (e.g., segmentation masks, partial observations, keypoints) may be presented as additional channels, FiLM-style modulators, or concatenated as input to fusion modules, facilitating powerful, controllable conditional generative modeling in 3D.

3. Applications Across Domains

Medical Imaging

Volumetric diffusion models underpin state-of-the-art synthesis, denoising, and super-resolution in MRI, CT, and PET data (Zhu et al., 2023, Yoon et al., 2024, Zhu et al., 13 Jan 2025, Choo et al., 2024). Approaches such as Make-A-Volume extend 2D latent diffusion to 3D by inserting identity-initialized 1D inter-slice convolutions for volumetric consistency (Zhu et al., 2023). Residual-based 3D patchwise diffusion models achieve leading PET/MR denoising performance with limited hardware and robust anatomical preservation (Yoon et al., 2024). The Score-Fusion framework leverages frozen 2D experts with a lightweight trainable 3D U-Net for efficient multi-modality 3D translation (Zhu et al., 13 Jan 2025).

3D Geometry Generation and Rendering

AutoDecoding latent 3D diffusion models generate articulated or static 3D assets via a 3D latent-space diffusion process, followed by a neural renderer producing view-consistent appearance (Ntavelis et al., 2023). Primitive-based approaches such as PrimDiffusion model humans as articulated clouds of volumetric primitives, allowing real-time rendering, animation, and downstream tasks (texture transfer, inpainting) without decoder inference (Chen et al., 2023).

Mesh and Spline Generation

Volumetric diffusion has been adapted for mesh generation at industrial standards. Neural Volumetric Mesh Generator uses a DDPM on 3D voxel grids, with a mesh extraction and regularization pipeline to yield artifact-free tetrahedral meshes from noise (Zheng et al., 2022). DDPM-Polycube learns to diffuse point clouds toward valid polycube structures for hexahedral meshing and volumetric spline construction, enabling out-of-distribution topology generalization critical for isogeometric analysis (Yu et al., 16 Mar 2025).

3D Perception and Inverse Problems

Volumetric diffusion processes are utilized in inverse problems with complex forward models, exemplified by Spectral Diffusion Posterior Sampling for spectral CT material decomposition, which alternates slice-wise diffusion prior steps with 3D model-based updates using compressed polychromatic forward operators (Jiang et al., 28 Mar 2025). Probabilistic scene perception (e.g., multi-view stereo, semantic scene completion) benefits from multi-step volumetric probability diffusion, which robustly refines 3D probability volumes under challenging conditions (Li et al., 2023).

4. Supervision, Training Objectives, and Evaluation Protocols

Supervised targets include 2D/3D segmentations, radiance/density volumes, or multi-channel probability volumes, with objective terms tailored to each domain:

Denoising-score-matching loss (simplified MSE between predicted and true noise);
Perceptual, mask, and silhouette losses (e.g., pyramidal VGG-based for RGB, $L_1$ / $C$ 0 norms for mask supervision);
Specialized patch-wise or volumetric regularization for data efficiency.

For evaluation, domain-appropriate metrics are used:

Medical imaging: MAE, PSNR, SSIM, Dice, Hausdorff distances;
Geometry: Chamfer/Earth Mover's distance, mesh quality (aspect ratio, flips, self-intersections), minimum matching distance;
Synthesis: FID, KID, LPIPS for realism and diversity;
Inverse problems: RMSE, contrast ROI errors, inter-slice continuity.

Notable results include state-of-the-art FID/KID in 3D object and human generation (Ntavelis et al., 2023, Chen et al., 2023), volumetric Dice $C$ 1 for condition-aligned medical synthesis with limited data (Ahn et al., 2024), and sub-percent contrast errors in spectral CT decomposition (Jiang et al., 28 Mar 2025).

5. Computational and Practical Considerations

Scaling volumetric diffusion is hampered by high memory and compute requirements. Effective strategies include:

Patchwise 3D training for large volumes (Yoon et al., 2024);
Hybrid 2D-3D architectures (Make-A-Volume, Score-Fusion), which restrict 3D learning to lightweight modules while leveraging robust pretrained 2D backbones (Zhu et al., 2023, Zhu et al., 13 Jan 2025);
Compression of high-fidelity analytic forward operators (as in spectral CT) to a tractable number of energy bins (Jiang et al., 28 Mar 2025);
Decoder-free rendering with volumetric primitives for real-time view synthesis (Chen et al., 2023).

Empirical findings demonstrate that, with these measures, practical large-volume training and inference is achievable on single high-memory GPUs (as in $C$ 2 voxel spectral CT), and real-time synthesis rates are achievable for human generation (Chen et al., 2023, Jiang et al., 28 Mar 2025).

6. Challenges, Limitations, and Outlook

Fundamental challenges for volumetric diffusion models include:

Memory and compute: 3D convolutions substantially increase resource requirements, requiring careful network and training design (Ahn et al., 2024, Zhu et al., 2023).
Volumetric consistency: Slice-agnostic or pure 2D models often fail to preserve 3D coherence; inter-slice operations and explicit volumetric modules are required (Zhu et al., 2023, Choo et al., 2024).
Data scarcity: Especially acute in 3D medical domains, addressed via parameter-efficient plug-in modules and hybrid approaches (Ahn et al., 2024).
Conditioning and control: Complex spatial or multimodal control requires tailored fusion or conditioning modules, with plugin architectures (VCM, Score-Fusion) showing strong results and extensibility (Ahn et al., 2024, Zhu et al., 13 Jan 2025).

Broader implications suggest the paradigms developed for volumetric diffusion (such as modular 3D control, patchwise training, and hybrid ensembling of 2D experts) are transferrable to diverse volumetric learning domains—dynamic imaging, inverse problems, and simulation. Open research directions include accelerated sampling (SDE/ODE solvers), joint physics-parameter estimation, extension to dynamic (4D) sequences, and further scaling and generalization of plug-and-play conditioning modules.

Primary references: (Ntavelis et al., 2023, Zhu et al., 2023, Jiang et al., 28 Mar 2025, Ahn et al., 2024, Xing et al., 2023, Yoon et al., 2024, Zheng et al., 2022, Yu et al., 16 Mar 2025, Li et al., 2023, Tang et al., 2023, Zhu et al., 13 Jan 2025, Choo et al., 2024, Chen et al., 2023, Taylor et al., 2015).