Adversarial Symmetric VAE
- AS-VAE is a generative framework that unifies variational autoencoders and adversarial methods by replacing the traditional KL divergence with its symmetric variant.
- It employs adversarial training with a critic to approximate log-density ratios, enabling stable gradient updates and connections with GAN/ALI models.
- AS-VAE demonstrates robust performance on benchmarks like MNIST, CIFAR-10, and CelebA, achieving improved reconstruction and sample diversity.
The Adversarial Symmetric Variational Autoencoder (AS-VAE) is a generative modeling framework that unifies and generalizes variational autoencoders (VAEs) and adversarial learning techniques. AS-VAE replaces the traditional Kullback–Leibler divergence in VAEs with its symmetric variant, leading to an objective that enforces bidirectional matching between encoder and decoder joint distributions. The arising optimization problem admits an adversarial solution, providing gradient-stable training and naturally connecting to frameworks such as GANs and Adversarially Learned Inference (ALI) (Chen et al., 2017); (Pu et al., 2017).
1. Symmetric KL Divergence and Model Formulation
AS-VAE is constructed upon the symmetric Kullback–Leibler divergence for probability densities and : For joint distributions (encoder joint) and (decoder joint), the objective becomes the minimization of .
Traditional VAE maximizes the evidence lower bound (ELBO): AS-VAE augments this with its dual: The sum forms a symmetric variational bound, equivalent (up to constants) to the negative symmetric KL divergence between the encoder and decoder joints: This formulation induces support matching between both marginals and conditionals.
2. Adversarial Training and Likelihood-Free Optimization
To achieve likelihood-free optimization of the symmetric KL, AS-VAE introduces an adversarial critic : At optimality, . The encoder and generator are then updated using: This adversarial procedure is directly analogous to the minimax objectives employed in GANs but specifically targets the joint distribution versus . The approach can be specialized to recover ALI, GAN, or WGAN objectives as limiting cases, providing a principled unification (Chen et al., 2017).
In the dual-discriminator approach (Pu et al., 2017), two discriminators and are trained to approximate log-density ratios of the encoder and decoder forms, optimizing GAN-style losses with respect to each side.
3. Network Architectures and Implementation
AS-VAE employs deep convolutional architectures, with encoder/decoder/discriminator parameterizations varying by dataset. All models are implemented in TensorFlow, using Xavier initialization and Adam optimizer (learning rate , batch size 64).
MNIST (28×28 grayscale)
- Encoder : Sequential convolution (5×5, 16 filters, stride 2, ReLU+BatchNorm; 5×5, 32 filters, stride 2, ReLU+BatchNorm), fully connected to latent dimension .
- Decoder : FC layers (1024, then 3136 units with BN+ReLU), reshaped and followed by deconv stacks (5×5, 64→1 filters, stride 2, final sigmoid).
- Discriminator : Conv stack on (32, 64, 128 filters), MLP on , concatenated, then MLP to a single logit.
CIFAR-10 and CelebA
Analogue architectures with deeper/wider convolutional layers, Leaky ReLU non-linearities, auxiliary Gaussian noise, and pervasive batch normalization are used.
See the summary table for key elements:
| Module | MNIST Example | CIFAR-10/CelebA Example |
|---|---|---|
| Encoder | Conv→Conv→FC | Conv stack→FC→ |
| Decoder | FC→FC→Deconv stack | FC→Deconv stack |
| Discriminator | Conv + MLP | Conv (x-path) + FC (z-path) + concat→FC |
4. Training Procedure
Training proceeds in alternating steps:
- Sample minibatches of and from data and prior .
- Sample latent codes via the encoder, and reconstructions via the decoder.
- Update the discriminator to maximize using real and generated joint samples.
- Update generator/encoder parameters using , or its augmented version in sVAE-r, which introduces a reconstruction augmentation term :
- Repeat until convergence (Chen et al., 2017); (Pu et al., 2017).
5. Theoretical Insights and Connections
AS-VAE's symmetric KL divergence objective enforces bidirectional support matching, i.e., and , directly addressing the "unrealistic-samples" artifact of standard VAEs. The adversarial ratio estimator ensures stable gradients, sidestepping the vanishing-gradient pathology of binary cross-entropy, as in GAN training.
Crucially, the framework recovers ALI exactly under a sigmoidal critic, and GAN/WGAN if the encoder is omitted or additional constraints (Lipschitz continuity) are imposed. The optional reconstruction augmentation parameter improves cycle-consistency by strengthening per-sample reconstructions—something not inherently present in vanilla ALI (Chen et al., 2017).
6. Empirical Evaluation
AS-VAE is evaluated on a range of benchmarks: Toy 2D Gaussian Mixture, MNIST, CelebA, and CIFAR-10. The main metrics include reconstruction MSE, log-likelihood (estimated via AIS/ELBO), and Inception Score (IS).
- MNIST: sVAE-r achieves an AIS log- of nats and IS of $9.12$, outperforming ALI, GAN, WGAN-GP, DCGAN, and equaling PixelRNN in likelihood estimates.
- CelebA (64×64): sVAE-r exhibits both sharp samples and high-fidelity reconstructions across . Baselines like ALICE trade off reconstruction and sample quality.
- CIFAR-10: Unsupervised IS reaches $6.96$ (ALI: $5.34$; DCGAN: $6.16$; WGAN-GP: $6.56$). RMSE and NLL are likewise competitive or superior.
- Toy 2D GMM: Over 576 random trials, sVAE-r exceeds ALICE in both IS and MSE, evidencing architectural and hyperparameter stability (Chen et al., 2017); (Pu et al., 2017).
Summary of selected experimental metrics:
| Dataset | Metric | AS-VAE/sVAE(-r) | Baselines |
|---|---|---|---|
| MNIST | IS | 9.12 | WGAN-GP: 8.45, GAN: 8.34 |
| AIS log | –79.26 nats | PixelRNN: –79.2 | |
| CIFAR-10 | IS | 6.96 | ALI: 5.34 |
| CelebA | Recon/Gen | Both good | ALICE: trade-off |
AS-VAE achieves reconstruction errors closely matching or surpassing VAE/ELBO methods while generating sharper, more diverse samples than GAN/ALI baselines.
7. Significance and Conclusions
AS-VAE establishes a theoretical and algorithmic bridge between the variational-inference orientation of VAEs and the adversarial dynamics of GANs. Its use of the symmetric KL as a foundational divergence leads to balanced learned distributions, improved sample quality, robust reconstructions, and enhanced stability in training (Chen et al., 2017); (Pu et al., 2017). Empirical evaluations corroborate its strong performance across generative modeling benchmarks. The architecture is extensible to diverse modalities (images, codes), and the framework inherently sidesteps several classical pathologies of deep generative models.