Bijective Dual-Cycle VAE-GAN
- The paper introduces a dual-cycle constrained VAE-GAN that enforces near-invertible mappings for image-to-image translation between MRI modalities.
- It integrates a shared latent space with cycle-consistency, VAE reconstruction, and adversarial losses to synthesize high-resolution cine MRIs from tagged inputs.
- Quantitative results show improved SSIM, PSNR, and inception scores, demonstrating enhanced anatomical fidelity and reduced acquisition time.
A bijective dual-cycle constrained variational autoencoder-generative adversarial network (VAE-GAN) is a deep generative modeling architecture designed for image-to-image translation between paired domains, specifically enforcing near-invertible mappings via cycle consistency and a shared latent space. It integrates a variational autoencoder backbone, adversarial training, and dual-cycle reconstruction losses to generate high-fidelity, anatomically consistent cine magnetic resonance images (MRIs) from lower-resolution tagged MRI inputs. This approach is particularly motivated by the clinical need to synthesize high-resolution cine MRIs from tagged images while reducing acquisition time and cost, as demonstrated in tongue motion MRI applications (Liu et al., 2021).
1. Architectural Overview
The dual-cycle constrained bijective VAE-GAN comprises two symmetric domain branches: one for the tagged MRI domain (T) and one for the cine MRI domain (C). Each branch contains:
- An encoder (), mirroring the Pix2Pix bottleneck design with strided convolutions, instance normalization, and Leaky-ReLU activations. For input image , outputs and to parametrize the Gaussian approximate posterior in a shared latent space .
- A decoder (), implemented as the inverse of the encoder architecture, using fractional-stride convolutions, instance normalization, ReLU activations, and tanh or clipping at output.
- A PatchGAN discriminator (), configured as a 70×70 window which classifies overlapping subregions as real or fake, in accordance with Pix2Pix practice.
Both branches operate over normalized, 256×256-pixel images, with the latent space strictly unified to enforce code sharing between T and C domains. Inference requires only the pathway. The model also includes bidirectional translation via and , which are regularized by dual cycle-consistency constraints to enforce approximate bijectivity.
2. Losses and Training Objectives
The model jointly optimizes the following losses for each domain :
- VAE Evidence Lower Bound (ELBO):
where denotes the complementary domain, enforcing cross-domain L1 reconstruction and a KL penalty.
- Adversarial GAN Loss:
For instance, on the cine branch,
and analogously for .
- Dual-Cycle Consistency Loss:
Enforces round-trip fidelity between the two domains,
The total loss alternates between minimizing with respect to encoders/decoders and maximizing each discriminator:
3. Bijectivity and Cycle Consistency
Approximate invertibility is achieved through three critical design choices:
- Cycle-consistency losses () require that mapping an image from one domain to another and back reconstructs the original, acting as a strong regularizer against geometric or structural deviation.
- A single latent space shared between both domains tightly couples the two mappings, mitigating risks of mode-collapse or non-identifiable mappings.
- No further Jacobian-based or explicit invertibility regularization is imposed; the approximate bijection follows from reconstruction and cycle terms. A plausible implication is that exact invertibility is not strictly enforced, but in practice the cycle constraint is sufficient for empirical fidelity.
4. Training Regimen and Implementation Details
Training was conducted on a dataset of paired tongue MR slices from 20 healthy subjects, split as follows: 10 subjects (1,768 pairs) for training, 2 subjects (416 pairs) for validation, and 8 subjects (1,560 pairs) for held-out testing. Preprocessing involved resizing all images to 256×256 and intensity normalization; no augmentations were used.
Core hyperparameters, determined via validation grid search, included α (KL weight) = 1.0, β (cycle weight) = 1.0, and λ (GAN weight) = 0.5. The Adam optimizer (β₁ = 0.5, β₂ = 0.999) was used, with learning rates of for encoders/decoders and for discriminators. Batch size was set to 1, and total training required approximately 4 hours on an NVIDIA V100 GPU; inference on a single slice is about 0.1 s.
5. Quantitative and Qualitative Performance
Performance was evaluated using mean L1 error, structural similarity index (SSIM), peak signal-to-noise ratio (PSNR), and inception score (IS), as described below.
| Method | L1 (↓) | SSIM (↑) | PSNR (↑) | IS (↑) |
|---|---|---|---|---|
| VAE only | 148.8 ± 0.1 | 0.9486 ± 0.0014 | 32.68 ± 0.06 | 11.13 ± 0.11 |
| VAE-GAN [Larsen et al.] | 151.4 ± 0.2 | 0.9507 ± 0.0013 | 34.14 ± 0.05 | 11.68 ± 0.17 |
| Pix2Pix [Isola et al.] | 150.2 ± 0.2 | 0.9612 ± 0.0011 | 36.81 ± 0.07 | 13.77 ± 0.15 |
| Proposed | 149.6 ± 0.2 | 0.9746 ± 0.0015 | 38.72 ± 0.07 | 15.58 ± 0.12 |
In qualitative assessment, VAE-only outputs exhibit over-smoothing and loss of fine texture; VAE-GAN introduces more realism but can introduce anatomical distortion. Pix2Pix enhances sharpness but may cause registration errors. The proposed dual-cycle bijective VAE-GAN preserves high-frequency fidelity and anatomical structural consistency.
6. Practical Considerations, Limitations, and Future Directions
The architecture is effective in synthesizing cine MR images while preserving anatomical accuracy and image sharpness, suggesting that joint cycle-consistency with a shared latent space is synergistic for modality translation tasks. However, its present validation is limited to healthy tongue MRI data. Generalization to other organs, patient populations, or scanner types remains untested. The model does not incorporate explicit Jacobian or spectral invertibility penalties, so bijection is approximate rather than exact; methodological extensions such as normalizing flow modules could address this. Integrating the framework with segmentation or motion-tracking networks represents a practical avenue for fully automated organ motion analysis. These factors delimit the current scope but also point toward future research directions in leveraging cycle-constrained, bijective deep generative models in medical imaging (Liu et al., 2021).