Wavelet-Conditioned ControlNet for 3D PET Denoising
- The paper introduces a novel 3D diffusion model that integrates explicit wavelet structural priors to preserve anatomical integrity in low-dose PET imaging.
- It employs a lightweight ControlNet branch to inject noise-invariant, low-frequency information from wavelet decompositions during the reverse diffusion process.
- Quantitative results demonstrate significant improvements in PSNR, SSIM, and other metrics, ensuring robust performance across varied noise regimes.
Wavelet-Conditioned ControlNet (WCC-Net) is a fully 3D diffusion-based framework developed for denoising low-dose Positron Emission Tomography (PET) images. It addresses the critical challenge of preserving anatomical integrity in volumetric PET scans acquired at ultra-low radiotracer doses, where conventional methods are prone to oversmoothing or hallucinating fine structures due to high-amplitude, Poisson-like noise. WCC-Net injects explicit frequency-domain structural priors, derived from wavelet decompositions, into a pretrained and frozen diffusion backbone via a lightweight ControlNet-style branch. This design decouples anatomical information from stochastic noise, ensuring anatomical fidelity while retaining the generative expressiveness of the diffusion backbone (Jing et al., 11 Jan 2026).
1. Motivation and Conceptual Foundations
Conventional PET denoising models (CNNs, GANs, and standard diffusion models) face fundamental limitations in ultra-low signal-to-noise regimes: high-frequency content becomes noise-dominated, impeding the preservation of subtle anatomical details. Standard diffusion models conditioned only on noisy spatial data often fail to disentangle noise from structure, leading to either anatomical distortion or oversmoothed outputs.
WCC-Net introduces an explicit frequency-domain conditioning mechanism through wavelet subbands, most notably the noise-robust low-frequency component, which captures global anatomical uptake patterns. By integrating these priors into the denoising process, WCC-Net enforces structural consistency throughout the reverse diffusion trajectory (see Fig. 1 in (Jing et al., 11 Jan 2026)).
2. Diffusion Backbone Architecture
The backbone is a 3D conditional Denoising Diffusion Probabilistic Model (DDPM), employing a U-Net with encoder–decoder and skip connections, operating on PET subvolumes of size . The noise-prediction network is pretrained on 1/20-dose to normal-dose PET mappings using a linear noise schedule over timesteps (). The U-Net weights are frozen during WCC-Net training; only the ControlNet branch parameters are updated, maintaining the original backbone generative behavior and preventing overfitting to training noise distributions.
3. Wavelet Structural Priors
The wavelet prior is extracted using a 1-level 3D discrete wavelet transform (DWT) applied to the low-dose input , yielding eight subbands:
where 'L' and 'H' indicate low- and high-pass filtering, respectively, along each spatial axis. The subband encodes coarse, low-frequency anatomical structure robust to noise, while other subbands (e.g., with one or two 'H's) capture directional and mid-frequency details. The subband predominantly reflects noise.
WCC-Net conditions primarily on , offering stable, noise-invariant anatomical guidance. This approach separates stable anatomical structure from stochastic noise, as detailed in Section 5.3 of (Jing et al., 11 Jan 2026).
4. ControlNet Conditioning Mechanism
The ControlNet branch processes selected wavelet subbands (default LLL) via a 3D transposed convolution to match feature map dimensions. It mirrors the encoder layers of the frozen U-Net, but with trainable parameters . At each encoder stage , the wavelet features are injected into the backbone features via a zero-initialized convolution (ZeroConv):
This ensures the backbone operates unaltered at initialization; the ControlNet learns to modulate intermediate feature maps based on the wavelet prior, introducing frequency-conditioned structure during training. The overall block at resolution is:
where is the -th U-Net encoder block.
5. Diffusion Process, Training, and Losses
The forward diffusion process is parametrized as:
with cumulative noise schedule . The model learns to approximate the reverse transitions via:
and generates denoised samples using the Euler-discretized update:
where . Learning is governed by the wavelet-conditioned MSE loss (only trainable, frozen):
The standard DDPM loss is used only for backbone pretraining.
6. Quantitative and Qualitative Evaluation
Evaluation was conducted on Siemens Biograph Vision Quadra 18F-FDG whole-body PET data, using 1/20-dose scans for training/validation/testing (297/20/60 splits), and unseen 1/50 and 1/4 dose for generalization. Patches of were cropped from volumes. Training used Adam (lr , , ), batch size 4, for 300k iterations.
Results demonstrate that WCC-Net outperforms 3D DDPM, CNN, and GAN baselines in all tested regimes:
| Dose | Metric | 3D DDPM | WCC-Net | Improvement |
|---|---|---|---|---|
| 1/20 (seen) | PSNR [dB] | 42.38±1.58 | 43.59±1.40 | +1.21 |
| SSIM | 0.976±0.008 | 0.984±0.005 | +0.008 | |
| GMSD | 0.014±0.003 | 0.011±0.003 | –0.003 | |
| NMAE | 0.117±0.018 | 0.111±0.014 | –0.006 | |
| 1/50 (unseen) | PSNR [dB] | 39.52±1.85 | 40.75±1.80 | +1.23 |
| SSIM | 0.964±0.010 | 0.976±0.011 | +0.012 | |
| GMSD | 0.019±0.005 | 0.014±0.006 | –0.005 | |
| NMAE | 0.151±0.031 | 0.132±0.027 | –0.019 | |
| 1/4 (unseen) | PSNR [dB] | 44.78±1.53 | 45.24±1.35 | +0.46 |
| SSIM | 0.981±0.003 | 0.992±0.003 | +0.011 | |
| GMSD | 0.010±0.004 | 0.007±0.003 | –0.003 | |
| NMAE | 0.089±0.010 | 0.084±0.013 | –0.005 |
All improvements are statistically significant (paired Wilcoxon ). Qualitatively (see Figs. 2–4 in (Jing et al., 11 Jan 2026)), WCC-Net more accurately recovers thin cortical boundaries and small pathological lesions, suppresses oversmoothing, and tracks ground-truth intensity across anatomical profiles.
7. Generalization, Clinical Implications, and Algorithmic Summary
WCC-Net robustly generalizes to unseen noise regimes (e.g., 1/50- and 1/4-dose) due to its reliance on low-frequency wavelet priors, which remain stable across a wide range of noise levels. The architecture prevents overfitting to dose-specific artifacts by decoupling anatomical consistency (from the wavelet prior) and denoising (from the diffusion backbone). A plausible implication is improved safety and feasibility of clinical PET at ultra-low radiotracer doses—potentially enabling up to 95% dose reduction while maintaining diagnostic utility.
The training and inference workflows are captured in Algorithm 1 and Algorithm 2 of (Jing et al., 11 Jan 2026):
Algorithm 1: WCC-Net Training
For each minibatch of paired :
- Sample timestep and
- Compute
- Extract wavelet prior
- Predict noise with
- Compute squared error loss and update
Algorithm 2: WCC-Net Inference
- For to $1$:
- Extract
- Predict
- Update via Euler step
At termination, approximates the denoised PET volume.
WCC-Net is a significant development for volumetric PET denoising, providing anatomically consistent denoising performance in ultra-low SNR regimes through frequency-domain conditioning and a modular ControlNet architecture (Jing et al., 11 Jan 2026).