Papers
Topics
Authors
Recent
Search
2000 character limit reached

Wavelet-Conditioned ControlNet for 3D PET Denoising

Updated 18 January 2026
  • The paper introduces a novel 3D diffusion model that integrates explicit wavelet structural priors to preserve anatomical integrity in low-dose PET imaging.
  • It employs a lightweight ControlNet branch to inject noise-invariant, low-frequency information from wavelet decompositions during the reverse diffusion process.
  • Quantitative results demonstrate significant improvements in PSNR, SSIM, and other metrics, ensuring robust performance across varied noise regimes.

Wavelet-Conditioned ControlNet (WCC-Net) is a fully 3D diffusion-based framework developed for denoising low-dose Positron Emission Tomography (PET) images. It addresses the critical challenge of preserving anatomical integrity in volumetric PET scans acquired at ultra-low radiotracer doses, where conventional methods are prone to oversmoothing or hallucinating fine structures due to high-amplitude, Poisson-like noise. WCC-Net injects explicit frequency-domain structural priors, derived from wavelet decompositions, into a pretrained and frozen diffusion backbone via a lightweight ControlNet-style branch. This design decouples anatomical information from stochastic noise, ensuring anatomical fidelity while retaining the generative expressiveness of the diffusion backbone (Jing et al., 11 Jan 2026).

1. Motivation and Conceptual Foundations

Conventional PET denoising models (CNNs, GANs, and standard diffusion models) face fundamental limitations in ultra-low signal-to-noise regimes: high-frequency content becomes noise-dominated, impeding the preservation of subtle anatomical details. Standard diffusion models conditioned only on noisy spatial data often fail to disentangle noise from structure, leading to either anatomical distortion or oversmoothed outputs.

WCC-Net introduces an explicit frequency-domain conditioning mechanism through wavelet subbands, most notably the noise-robust low-frequency yLLLy_{LLL} component, which captures global anatomical uptake patterns. By integrating these priors into the denoising process, WCC-Net enforces structural consistency throughout the reverse diffusion trajectory (see Fig. 1 in (Jing et al., 11 Jan 2026)).

2. Diffusion Backbone Architecture

The backbone is a 3D conditional Denoising Diffusion Probabilistic Model (DDPM), employing a U-Net with encoder–decoder and skip connections, operating on PET subvolumes of size 96396^3. The noise-prediction network fθ(xt,t,y)f_\theta(x_t, t, y) is pretrained on 1/20-dose to normal-dose PET mappings using a linear noise schedule over T=1000T=1000 timesteps (β1,,βT\beta_1,\dotsc,\beta_T). The U-Net weights θ\theta are frozen during WCC-Net training; only the ControlNet branch parameters ϕ\phi are updated, maintaining the original backbone generative behavior and preventing overfitting to training noise distributions.

3. Wavelet Structural Priors

The wavelet prior is extracted using a 1-level 3D discrete wavelet transform (DWT) applied to the low-dose input yRH×W×Dy \in \mathbb{R}^{H\times W\times D}, yielding eight subbands:

W(y)={yαβγα,β,γ{L,H}}W(y) = \{y_{\alpha\beta\gamma}\mid \alpha,\beta,\gamma\in\{L,H\}\}

where 'L' and 'H' indicate low- and high-pass filtering, respectively, along each spatial axis. The yLLLy_{LLL} subband encodes coarse, low-frequency anatomical structure robust to noise, while other subbands (e.g., with one or two 'H's) capture directional and mid-frequency details. The subband yHHHy_{HHH} predominantly reflects noise.

WCC-Net conditions primarily on yLLLy_{LLL}, offering stable, noise-invariant anatomical guidance. This approach separates stable anatomical structure from stochastic noise, as detailed in Section 5.3 of (Jing et al., 11 Jan 2026).

4. ControlNet Conditioning Mechanism

The ControlNet branch processes selected wavelet subbands Cwav=Ws(y)C_{\mathrm{wav}} = W_s(y) (default s={s=\{LLL}\}) via a 3D transposed convolution to match feature map dimensions. It mirrors the encoder layers of the frozen U-Net, but with trainable parameters ϕ\phi. At each encoder stage \ell, the wavelet features z()z^{(\ell)} are injected into the backbone features h()h^{(\ell)} via a zero-initialized 1×1×11\times1\times1 convolution (ZeroConv):

h~()=h()+ZeroConv()(z())\tilde{h}^{(\ell)} = h^{(\ell)} + \mathrm{ZeroConv}^{(\ell)}(z^{(\ell)})

This ensures the backbone operates unaltered at initialization; the ControlNet learns to modulate intermediate feature maps based on the wavelet prior, introducing frequency-conditioned structure during training. The overall block at resolution \ell is:

y(+1)=F()(h~())y^{(\ell+1)} = F^{(\ell)}\bigl(\tilde{h}^{(\ell)}\bigr)

where F()F^{(\ell)} is the \ell-th U-Net encoder block.

5. Diffusion Process, Training, and Losses

The forward diffusion process is parametrized as:

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t|x_{t-1}) = \mathcal{N}\big(x_t;\sqrt{1-\beta_t}x_{t-1},\beta_t I\big)

with cumulative noise schedule αt=s=1t(1βs)\overline{\alpha}_t = \prod_{s=1}^t (1-\beta_s). The model learns to approximate the reverse transitions via:

pθ(xt1xt,y,Cwav)=N(xt1;μθ(xt,t,y,Cwav),σt2I)p_\theta(x_{t-1}|x_t, y, C_{\mathrm{wav}}) = \mathcal{N}\big(x_{t-1};\mu_\theta(x_t, t, y, C_{\mathrm{wav}}), \sigma_t^2 I\big)

and generates denoised samples using the Euler-discretized update:

xt1=1αt(xt1αt1αtϵθ(xt,t,y,Cwav))+σtzx_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\overline{\alpha}_t}}\epsilon_\theta(x_t, t, y, C_{\mathrm{wav}})\right) + \sigma_t z

where zN(0,I)z\sim\mathcal{N}(0, I). Learning is governed by the wavelet-conditioned MSE loss (only ϕ\phi trainable, θ\theta frozen):

LWCC=Ex0,y,t,ϵϵϵϕ(xt,t,y,Cwav)22\mathcal{L}_{\mathrm{WCC}} = \mathbb{E}_{x_0,y,t,\epsilon}\|\epsilon - \epsilon_\phi(x_t, t, y, C_{\mathrm{wav}})\|_2^2

The standard DDPM loss is used only for backbone pretraining.

6. Quantitative and Qualitative Evaluation

Evaluation was conducted on Siemens Biograph Vision Quadra 18F-FDG whole-body PET data, using 1/20-dose scans for training/validation/testing (297/20/60 splits), and unseen 1/50 and 1/4 dose for generalization. Patches of 96396^3 were cropped from 192×288×520192\times288\times520 volumes. Training used Adam (lr 1×1041\times10^{-4}, β1=0.9\beta_1=0.9, β2=0.999\beta_2=0.999), batch size 4, for 300k iterations.

Results demonstrate that WCC-Net outperforms 3D DDPM, CNN, and GAN baselines in all tested regimes:

Dose Metric 3D DDPM WCC-Net Improvement
1/20 (seen) PSNR [dB] 42.38±1.58 43.59±1.40 +1.21
SSIM 0.976±0.008 0.984±0.005 +0.008
GMSD 0.014±0.003 0.011±0.003 –0.003
NMAE 0.117±0.018 0.111±0.014 –0.006
1/50 (unseen) PSNR [dB] 39.52±1.85 40.75±1.80 +1.23
SSIM 0.964±0.010 0.976±0.011 +0.012
GMSD 0.019±0.005 0.014±0.006 –0.005
NMAE 0.151±0.031 0.132±0.027 –0.019
1/4 (unseen) PSNR [dB] 44.78±1.53 45.24±1.35 +0.46
SSIM 0.981±0.003 0.992±0.003 +0.011
GMSD 0.010±0.004 0.007±0.003 –0.003
NMAE 0.089±0.010 0.084±0.013 –0.005

All improvements are statistically significant (paired Wilcoxon p<0.01p<0.01). Qualitatively (see Figs. 2–4 in (Jing et al., 11 Jan 2026)), WCC-Net more accurately recovers thin cortical boundaries and small pathological lesions, suppresses oversmoothing, and tracks ground-truth intensity across anatomical profiles.

7. Generalization, Clinical Implications, and Algorithmic Summary

WCC-Net robustly generalizes to unseen noise regimes (e.g., 1/50- and 1/4-dose) due to its reliance on low-frequency wavelet priors, which remain stable across a wide range of noise levels. The architecture prevents overfitting to dose-specific artifacts by decoupling anatomical consistency (from the wavelet prior) and denoising (from the diffusion backbone). A plausible implication is improved safety and feasibility of clinical PET at ultra-low radiotracer doses—potentially enabling up to 95% dose reduction while maintaining diagnostic utility.

The training and inference workflows are captured in Algorithm 1 and Algorithm 2 of (Jing et al., 11 Jan 2026):

Algorithm 1: WCC-Net Training

For each minibatch of paired (x0,y)(x_0, y):

  • Sample timestep tUniform(1,T)t\sim\mathrm{Uniform}(1,T) and ϵN(0,I)\epsilon\sim\mathcal{N}(0,I)
  • Compute xt=αtx0+1αtϵx_t = \sqrt{\overline{\alpha}_t}x_0 + \sqrt{1-\overline{\alpha}_t}\epsilon
  • Extract wavelet prior Cwav=Ws(y)C_{\mathrm{wav}} = W_s(y)
  • Predict noise ϵ^\hat{\epsilon} with fθ,ϕf_{\theta,\phi}
  • Compute squared error loss and update ϕ\phi

Algorithm 2: WCC-Net Inference

  • For t=Tt=T to $1$:
    • Extract Cwav=Ws(y)C_{\mathrm{wav}} = W_s(y)
    • Predict ϵ^\hat{\epsilon}
    • Update xt1x_{t-1} via Euler step

At termination, x0x_0 approximates the denoised PET volume.


WCC-Net is a significant development for volumetric PET denoising, providing anatomically consistent denoising performance in ultra-low SNR regimes through frequency-domain conditioning and a modular ControlNet architecture (Jing et al., 11 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Wavelet-Conditioned ControlNet (WCC-Net).