Papers
Topics
Authors
Recent
Search
2000 character limit reached

Doubly Stochastic Adversarial Autoencoders

Updated 20 February 2026
  • The paper introduces a doubly stochastic mechanism that replaces a fixed discriminator with random feature sampling to smooth gradients and mitigate mode collapse.
  • The methodology combines reconstruction loss with an adversarial penalty derived from stochastic random feature maps, bridging GAN and MMD approaches.
  • Experimental evaluations on MNIST show that DS-AAE improves sample diversity and latent space exploration compared to traditional AAEs and MMD-AEs.

A Doubly Stochastic Adversarial Autoencoder (DS-AAE) is a probabilistic autoencoder architecture in which the conventional adversary of an Adversarial Autoencoder (AAE) is replaced by a stochastic function sampled from a space of random frequency features. This introduces a second source of algorithmic randomness, which smooths gradients, regularizes the adversarial training, and mitigates mode collapse. DS-AAE interpolates between Maximum Mean Discrepancy Autoencoders (MMD-AE) and classical AAEs/GANs depending on the choice of random feature distribution and parameterizations (Azarafrooz, 2018).

1. Motivation and Background

Variational Autoencoders (VAE) and AAEs both enforce a prescribed prior on the latent code. VAEs achieve this via a closed-form KL divergence penalty matching the latent aggregated posterior Qθ(z)=Qθ(zx)pdata(x)dxQ_\theta(z)=\int Q_\theta(z|x)p_{\text{data}}(x)dx to the prior p(z)p(z), but this can yield sample blurriness and under-exploration of multimodal posteriors since the KL term pushes each qθ(zx)q_\theta(z|x) towards p(z)p(z) individually. AAEs replace the KL penalty with a GAN-style discriminator, yielding objectives of the form

minθ,ϕEx[recon(x;θ,ϕ)]+λsupD[Ezp(z)logD(z)+Exlog(1D(qθ(x)))]\min_{\theta,\phi} \mathbb{E}_x[\ell_{\text{recon}}(x;\theta,\phi)] + \lambda \cdot \sup_D \Big[ \mathbb{E}_{z\sim p(z)}\log D(z) + \mathbb{E}_x \log(1-D(q_\theta(x))) \Big]

AAEs can generate sharper samples, but adversarial training may suffer from mode collapse: the discriminator can quickly distinguish “fake” codes, pushing the encoder to collapse to a few latent modes to fool the discriminator.

DS-AAE replaces the deterministic discriminator DD with a space of stochastic functions, injecting additional randomness. This mechanism

  • Smooths gradients for the encoder,
  • Prevents overfitting of the adversary,
  • Encourages the generator to explore more latent modes,
  • Reduces mode collapse.

2. Model Components and Stochastic Adversary

The DS-AAE comprises an encoder qθ(zx)q_\theta(z|x), which deterministically maps input xx to latent code zz; a decoder/generator pϕ(xz)p_\phi(x|z); and an imposed prior p(z)p(z) (commonly N(0,I)\mathcal{N}(0,I)).

Conventional AAEs use a single neural network discriminator. DS-AAE instead defines a function class: A={fα(z)=αTζ(z):αRm,αR}\mathcal{A} = \{ f_\alpha(z) = \alpha^T \zeta(z) : \alpha \in \mathbb{R}^m, \|\alpha\| \leq R \} where ζ(z)\zeta(z) is a “doubly stochastic gradient feature” constructed by sampling a random frequency ww from a measure P(w)P(w)—often Gaussian for RBF kernels—and defining a random feature map

ζ(z;w)=ϕw(z)ϕw(z)\zeta(z;w) = \phi_w(z) - \phi_w(z')

with ϕw()\phi_w(\cdot), for example, set to exp(iwTz)\exp(iw^Tz). Each function fαf_\alpha utilized during training depends on a fresh random draw wP(w)w \sim P(w), ensuring the stochasticity of the adversary.

3. Doubly Stochastic Minimax Objective

DS-AAE is formulated as a doubly stochastic saddle point problem: minG=(θ,ϕ)maxfAL(G,f)\min_{G=(\theta,\phi)} \max_{f \in \mathcal{A}} L(G, f) with objective

L(G,f)=Expdata[recon(x;G)]+λDstoch(Qθ,p;f)L(G, f) = \mathbb{E}_{x \sim p_{\text{data}}}\big[\ell_{\text{recon}}(x;G)\big] + \lambda \cdot D_{\text{stoch}}(Q_\theta, p; f)

where the stochastic divergence is

Dstoch(Qθ,p;f)=EwP(w)[ξw],ξw=Ezp(z)ϕw(z)EzQθϕw(z)D_{\text{stoch}}(Q_\theta, p; f) = \mathbb{E}_{w\sim P(w)} \left[ \xi_w \right],\qquad \xi_w = \mathbb{E}_{z\sim p(z)}\phi_w(z) - \mathbb{E}_{z\sim Q_\theta}\phi_w(z)

Maximizing over α\alpha gives

maxαRαTζ=Rζ2,ζ=EwP(w)ξwϕw()\max_{\|\alpha\| \leq R} \alpha^T \zeta = R \|\zeta\|_2,\qquad \zeta = \mathbb{E}_{w\sim P(w)} \xi_w \phi_w(\cdot)

For large numbers of random features ww, this converges to the MMD divergence; if ϕw\phi_w is parameterized by a deep network and P(w)P(w) is degenerate, the formulation recovers the GAN divergence.

The overall loss includes:

  • The usual reconstruction penalty recon(x;G)\ell_{\text{recon}}(x;G),
  • The adversarial penalty λDstoch()\lambda\cdot D_{\text{stoch}}(\cdot),
  • A constraint αR\|\alpha\| \leq R or 2\ell_2 regularization on α\alpha.

4. Training Algorithm

Training in DS-AAE alternates between batch sampling of data and batch sampling of random features. The two sources of randomness are crucial for doubly stochastic optimization. The training loop proceeds as follows:

  1. Sample a minibatch of data x1,...,xBpdatax_1, ..., x_B \sim p_{\text{data}}.
  2. Encode: zi=qθ(xi)z_i = q_\theta(x_i).
  3. Sample prior codes z1,...,zBp(z)z'_1, ..., z'_B \sim p(z).
  4. Sample MM random features wjP(w)w_j \sim P(w), compute features ϕj(x)\phi_j(x).
  5. Compute the stochastic gradient terms:
    • ξj=1Bi=1Bϕj(zi)1Bi=1Bϕj(zi)\xi_j = \frac{1}{B} \sum_{i=1}^B \phi_j(z'_i) - \frac{1}{B} \sum_{i=1}^B \phi_j(z_i),
    • ζj=ξjϕj()\zeta_j = \xi_j \cdot \phi_j(\cdot).
  6. Update the adversary (α\alpha) by ascending the objective, including regularization.
  7. Update the encoder and decoder parameters (θ,ϕ)(\theta, \phi), using gradients from the reconstruction loss and the adversarial penalty.

Increasing the number of feature samples MM or using control variates mitigates variance introduced by random feature sampling.

5. Theoretical Properties

The convergence of the algorithm leverages results from stochastic optimization in Reproducing Kernel Hilbert Spaces (RKHS) [6], demonstrating that, when the adversary step size ηα\eta_\alpha is small, iterates αt\alpha_t remain in the RKHS defined by k(x,x)=ϕw(x)ϕw(x)dP(w)k(x, x') = \int \phi_w(x) \phi_w(x') dP(w). The minimax optimization thus converges to a stationary Nash point.

Introducing randomness via ww prevents the adversary from perfectly overfitting to the current generator, which compels the generator to explore the latent space more broadly, encouraging the discovery of additional latent modes and supporting a more uniform coverage of pdata(z)p_{\text{data}}(z).

6. Experimental Evaluation

Experiments were conducted primarily on MNIST (28×28), with additional preliminary results on CIFAR-10. Network architectures employed three fully-connected layers (1024→512→216), with ReLU activations and a final sigmoid for the decoder. The latent dimension was set to 6 for DS-AAE, with 20% input dropout and an Adam optimizer (learning rate 0.001). Minibatch and random feature batch sizes were both set to 1000 (RBF, σ=1\sigma=1).

Parzen-window log-likelihoods for 10,000 MNIST samples are summarized below:

Model Parzen LL (mean ± std)
GAN [3] 225 ± 2
GMMN+AE [5] 282 ± 2
AAE [1] 340 ± 2
MMD-AE [5] 228 ± 1.6
DS-AAE 243.2 ± 1.7

DS-AAE samples display higher visual diversity, with more heterogeneous digit styles across generated panels compared to the relatively homogeneous samples produced by standard AAE or MMD-AE. Latent space interpolations are sharp and cover multiple Gaussian modes, indicating a reduction in mode collapse and improved coverage of the latent manifold.

7. Extensions and Open Questions

DS-AAE demonstrates that replacing a fixed discriminator with a stochastic function space regularizes minimax training and promotes exploration, yielding loosely a continuous spectrum between GAN-based and MMD-based regularization. The doubly stochastic scheme (data plus random features) is integral to this effect.

However, DS-AAE is sensitive to the size of data and feature batches; using small batch sizes diminishes exploration. Its convergence may be slower than that of deterministic AAEs due to the additional randomness. Directions for further research include convolutional DS-AAEs, adaptive random feature sampling strategies, improved variance reduction, and establishing generalization bounds for the doubly stochastic adversary (Azarafrooz, 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Doubly Stochastic Adversarial Autoencoders.