Wavelet-Conditioned ControlNet for 3D PET Denoising

Updated 18 January 2026

The paper introduces a novel 3D diffusion model that integrates explicit wavelet structural priors to preserve anatomical integrity in low-dose PET imaging.
It employs a lightweight ControlNet branch to inject noise-invariant, low-frequency information from wavelet decompositions during the reverse diffusion process.
Quantitative results demonstrate significant improvements in PSNR, SSIM, and other metrics, ensuring robust performance across varied noise regimes.

Wavelet-Conditioned ControlNet (WCC-Net) is a fully 3D diffusion-based framework developed for denoising low-dose Positron Emission Tomography (PET) images. It addresses the critical challenge of preserving anatomical integrity in volumetric PET scans acquired at ultra-low radiotracer doses, where conventional methods are prone to oversmoothing or hallucinating fine structures due to high-amplitude, Poisson-like noise. WCC-Net injects explicit frequency-domain structural priors, derived from wavelet decompositions, into a pretrained and frozen diffusion backbone via a lightweight ControlNet-style branch. This design decouples anatomical information from stochastic noise, ensuring anatomical fidelity while retaining the generative expressiveness of the diffusion backbone (Jing et al., 11 Jan 2026).

1. Motivation and Conceptual Foundations

Conventional PET denoising models (CNNs, GANs, and standard diffusion models) face fundamental limitations in ultra-low signal-to-noise regimes: high-frequency content becomes noise-dominated, impeding the preservation of subtle anatomical details. Standard diffusion models conditioned only on noisy spatial data often fail to disentangle noise from structure, leading to either anatomical distortion or oversmoothed outputs.

WCC-Net introduces an explicit frequency-domain conditioning mechanism through wavelet subbands, most notably the noise-robust low-frequency $y_{LLL}$ component, which captures global anatomical uptake patterns. By integrating these priors into the denoising process, WCC-Net enforces structural consistency throughout the reverse diffusion trajectory (see Fig. 1 in (Jing et al., 11 Jan 2026)).

2. Diffusion Backbone Architecture

The backbone is a 3D conditional Denoising Diffusion Probabilistic Model (DDPM), employing a U-Net with encoder–decoder and skip connections, operating on PET subvolumes of size $96^3$ . The noise-prediction network $f_\theta(x_t, t, y)$ is pretrained on 1/20-dose to normal-dose PET mappings using a linear noise schedule over $T=1000$ timesteps ( $\beta_1,\dotsc,\beta_T$ ). The U-Net weights $\theta$ are frozen during WCC-Net training; only the ControlNet branch parameters $\phi$ are updated, maintaining the original backbone generative behavior and preventing overfitting to training noise distributions.

3. Wavelet Structural Priors

The wavelet prior is extracted using a 1-level 3D discrete wavelet transform (DWT) applied to the low-dose input $y \in \mathbb{R}^{H\times W\times D}$ , yielding eight subbands:

$W(y) = \{y_{\alpha\beta\gamma}\mid \alpha,\beta,\gamma\in\{L,H\}\}$

where 'L' and 'H' indicate low- and high-pass filtering, respectively, along each spatial axis. The $y_{LLL}$ subband encodes coarse, low-frequency anatomical structure robust to noise, while other subbands (e.g., with one or two 'H's) capture directional and mid-frequency details. The subband $y_{HHH}$ predominantly reflects noise.

WCC-Net conditions primarily on $y_{LLL}$ , offering stable, noise-invariant anatomical guidance. This approach separates stable anatomical structure from stochastic noise, as detailed in Section 5.3 of (Jing et al., 11 Jan 2026).

4. ControlNet Conditioning Mechanism

The ControlNet branch processes selected wavelet subbands $C_{\mathrm{wav}} = W_s(y)$ (default $s=\{$ LLL $\}$ ) via a 3D transposed convolution to match feature map dimensions. It mirrors the encoder layers of the frozen U-Net, but with trainable parameters $\phi$ . At each encoder stage $\ell$ , the wavelet features $z^{(\ell)}$ are injected into the backbone features $h^{(\ell)}$ via a zero-initialized $1\times1\times1$ convolution (ZeroConv):

$\tilde{h}^{(\ell)} = h^{(\ell)} + \mathrm{ZeroConv}^{(\ell)}(z^{(\ell)})$

This ensures the backbone operates unaltered at initialization; the ControlNet learns to modulate intermediate feature maps based on the wavelet prior, introducing frequency-conditioned structure during training. The overall block at resolution $\ell$ is:

$y^{(\ell+1)} = F^{(\ell)}\bigl(\tilde{h}^{(\ell)}\bigr)$

where $F^{(\ell)}$ is the $\ell$ -th U-Net encoder block.

5. Diffusion Process, Training, and Losses

The forward diffusion process is parametrized as:

$q(x_t|x_{t-1}) = \mathcal{N}\big(x_t;\sqrt{1-\beta_t}x_{t-1},\beta_t I\big)$

with cumulative noise schedule $\overline{\alpha}_t = \prod_{s=1}^t (1-\beta_s)$ . The model learns to approximate the reverse transitions via:

$p_\theta(x_{t-1}|x_t, y, C_{\mathrm{wav}}) = \mathcal{N}\big(x_{t-1};\mu_\theta(x_t, t, y, C_{\mathrm{wav}}), \sigma_t^2 I\big)$

and generates denoised samples using the Euler-discretized update:

$x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\overline{\alpha}_t}}\epsilon_\theta(x_t, t, y, C_{\mathrm{wav}})\right) + \sigma_t z$

where $z\sim\mathcal{N}(0, I)$ . Learning is governed by the wavelet-conditioned MSE loss (only $\phi$ trainable, $\theta$ frozen):

$\mathcal{L}_{\mathrm{WCC}} = \mathbb{E}_{x_0,y,t,\epsilon}\|\epsilon - \epsilon_\phi(x_t, t, y, C_{\mathrm{wav}})\|_2^2$

The standard DDPM loss is used only for backbone pretraining.

6. Quantitative and Qualitative Evaluation

Evaluation was conducted on Siemens Biograph Vision Quadra 18F-FDG whole-body PET data, using 1/20-dose scans for training/validation/testing (297/20/60 splits), and unseen 1/50 and 1/4 dose for generalization. Patches of $96^3$ were cropped from $192\times288\times520$ volumes. Training used Adam (lr $1\times10^{-4}$ , $\beta_1=0.9$ , $\beta_2=0.999$ ), batch size 4, for 300k iterations.

Results demonstrate that WCC-Net outperforms 3D DDPM, CNN, and GAN baselines in all tested regimes:

Dose	Metric	3D DDPM	WCC-Net	Improvement
1/20 (seen)	PSNR [dB]	42.38±1.58	43.59±1.40	+1.21
	SSIM	0.976±0.008	0.984±0.005	+0.008
	GMSD	0.014±0.003	0.011±0.003	–0.003
	NMAE	0.117±0.018	0.111±0.014	–0.006
1/50 (unseen)	PSNR [dB]	39.52±1.85	40.75±1.80	+1.23
	SSIM	0.964±0.010	0.976±0.011	+0.012
	GMSD	0.019±0.005	0.014±0.006	–0.005
	NMAE	0.151±0.031	0.132±0.027	–0.019
1/4 (unseen)	PSNR [dB]	44.78±1.53	45.24±1.35	+0.46
	SSIM	0.981±0.003	0.992±0.003	+0.011
	GMSD	0.010±0.004	0.007±0.003	–0.003
	NMAE	0.089±0.010	0.084±0.013	–0.005

All improvements are statistically significant (paired Wilcoxon $p<0.01$ ). Qualitatively (see Figs. 2–4 in (Jing et al., 11 Jan 2026)), WCC-Net more accurately recovers thin cortical boundaries and small pathological lesions, suppresses oversmoothing, and tracks ground-truth intensity across anatomical profiles.

7. Generalization, Clinical Implications, and Algorithmic Summary

WCC-Net robustly generalizes to unseen noise regimes (e.g., 1/50- and 1/4-dose) due to its reliance on low-frequency wavelet priors, which remain stable across a wide range of noise levels. The architecture prevents overfitting to dose-specific artifacts by decoupling anatomical consistency (from the wavelet prior) and denoising (from the diffusion backbone). A plausible implication is improved safety and feasibility of clinical PET at ultra-low radiotracer doses—potentially enabling up to 95% dose reduction while maintaining diagnostic utility.

The training and inference workflows are captured in Algorithm 1 and Algorithm 2 of (Jing et al., 11 Jan 2026):

Algorithm 1: WCC-Net Training

For each minibatch of paired $(x_0, y)$ :

Sample timestep $t\sim\mathrm{Uniform}(1,T)$ and $\epsilon\sim\mathcal{N}(0,I)$
Compute $x_t = \sqrt{\overline{\alpha}_t}x_0 + \sqrt{1-\overline{\alpha}_t}\epsilon$
Extract wavelet prior $C_{\mathrm{wav}} = W_s(y)$
Predict noise $\hat{\epsilon}$ with $f_{\theta,\phi}$
Compute squared error loss and update $\phi$

Algorithm 2: WCC-Net Inference

For $t=T$ $t = T$ to $1$:
- Extract $C_{\mathrm{wav}} = W_s(y)$
- Predict $\hat{\epsilon}$
- Update $x_{t-1}$ via Euler step

At termination, $x_0$ approximates the denoised PET volume.

WCC-Net is a significant development for volumetric PET denoising, providing anatomically consistent denoising performance in ultra-low SNR regimes through frequency-domain conditioning and a modular ControlNet architecture (Jing et al., 11 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

3D Wavelet-Based Structural Priors for Controlled Diffusion in Whole-Body Low-Dose PET Denoising (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Wavelet-Conditioned ControlNet (WCC-Net).

Wavelet-Conditioned ControlNet for 3D PET Denoising

1. Motivation and Conceptual Foundations

2. Diffusion Backbone Architecture

3. Wavelet Structural Priors

4. ControlNet Conditioning Mechanism

5. Diffusion Process, Training, and Losses

6. Quantitative and Qualitative Evaluation

7. Generalization, Clinical Implications, and Algorithmic Summary

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Wavelet-Conditioned ControlNet for 3D PET Denoising

1. Motivation and Conceptual Foundations

2. Diffusion Backbone Architecture

3. Wavelet Structural Priors

4. ControlNet Conditioning Mechanism

5. Diffusion Process, Training, and Losses

6. Quantitative and Qualitative Evaluation

7. Generalization, Clinical Implications, and Algorithmic Summary

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research