MobileStyleGAN: Efficient Edge Image Synthesis
- MobileStyleGAN is a lightweight style-based generative adversarial network that uses wavelet-domain synthesis to reduce computational and memory overhead while maintaining high image quality.
- It replaces dense convolutions with depthwise separable modulated convolutions and uses a parameter-free inverse discrete wavelet transform for efficient upsampling.
- The architecture achieves up to 9.5× reduction in MACs and 3.5× fewer parameters, with inference speeds up to 27× faster on edge devices despite a slight FID degradation.
MobileStyleGAN is a lightweight style-based generative adversarial network (GAN) architecture designed for high-fidelity image synthesis with drastically reduced computational and memory requirements compared to StyleGAN2. MobileStyleGAN targets edge-device deployment by employing architectural optimizations that minimize the number of parameters and multiply-accumulate operations (MACs), while preserving perceptual image quality (Belousov, 2021).
1. Architectural Overview
MobileStyleGAN retains the essential two-stage architecture of StyleGAN2: an 8-layer mapping network that transforms standard latent codes to intermediate latents , and an overview network that generates images conditioned on . The mapping network is left unchanged. All architectural simplifications and optimizations are focused on the synthesis network.
The primary innovation in MobileStyleGAN is the shift from direct pixel-space prediction to frequency-domain synthesis. Instead of generating pixel values at each stage, the network predicts discrete wavelet transform (DWT) coefficients—specifically, the four subbands (LL, LH, HL, HH) at each resolution—and applies an inverse DWT (IDWT) to upsample the representation. The Haar wavelet-based IDWT, utilized for upsampling, is parameter-free and multiply-free per 2×2 image block: This frequency-domain approach allows image synthesis to operate predominantly at lower spatial resolutions and concentrate computation on multiscale detail bands.
2. Synthesis Block Redesign and Depthwise Separable Modulated Convolutions
StyleGAN2's synthesis blocks use two dense, 3×3 modulated convolutions and a learned (typically transposed convolution) 2× upsampling operation, summing the RGB predictions from each resolution. MobileStyleGAN departs from this paradigm as follows:
- Each upsampling block applies a parameter-free IDWT to double resolution, followed by two depthwise separable modulated convolutions (DS-ModConv).
- In DS-ModConv, a 3×3 modulated convolution is re-factored into:
- A depthwise 3×3 convolution.
- A pointwise 1×1 convolution. Both components are applied linearly (no intermediate nonlinearity), and modulation is pushed to the input activations, while demodulation is applied at the output: with demodulation coefficients
where .
During inference, demodulation becomes a fixed pointwise scaling by replacing with a learned constant ("demodulation fusion"), allowing the coefficient to be folded into .
Skip-layer RGB heads at each resolution (as in StyleGAN2) are eliminated; instead, there is a single generator head at the final resolution. Auxiliary pixel- or wavelet-level losses are used at intermediate scales to stabilize optimization.
3. Quantitative Comparison with StyleGAN2
The following table summarizes the key quantitative differences in model size, computational cost, and FID for face synthesis at 1024×1024 resolution using FFHQ data (Belousov, 2021):
| Network | #Params (M) | MACs (GMAC) | FID |
|---|---|---|---|
| StyleGAN2 | 28.27 | 143.15 | 2.84 |
| MobileStyleGAN | 8.01 | 15.09 | 7.75 |
MobileStyleGAN achieves a parameter reduction of 3.5× (28.27M 8.01M), and computational reduction of 9.5× in MACs (143.15G 15.09G). This is accomplished primarily through the replacement of dense convolutions with depthwise separable alternatives, and the shift to wavelet domain upsampling.
The formula for FLOPs (per layer) illustrates the efficiency gain:
Standard 3×3:
Depthwise separable: With , .
4. Image Fidelity and Capacity Trade-offs
While MobileStyleGAN introduces significant complexity reductions, there is an observed degradation in generative fidelity as measured by Fréchet Inception Distance (FID). On FFHQ :
StyleGAN2: FID = 2.84
MobileStyleGAN: FID = 7.75
Despite this numerical increase, visual analysis indicates that MobileStyleGAN preserves high-frequency detail owing to its wavelet-based generation. The knowledge-distillation and GAN-loss-based training pipeline compensates for the decreased model capacity, yielding qualitatively comparable images within a 10× smaller computational budget (Belousov, 2021).
5. Complexity-Reduction Mechanisms
MobileStyleGAN's reduction in resource requirements is achieved via the following mechanisms:
Wavelet (DWT/IDWT) representation: The network synthesizes images in the multiresolution, multiband wavelet domain, allowing computational effort to be concentrated on salient frequencies.
Depthwise separable modulated convolutions: Replacement of dense operations with DS-ModConv yields an approximate 9× reduction in per-layer MACs.
Demodulation fusion: Freezes the per-sample demodulation at inference, permitting its incorporation into model weights for reduced run-time overhead.
Simplification of generator heads: Multiple skip-layer prediction heads are replaced by a single final prediction layer augmented with intermediate auxiliary losses.
No further pruning or quantization: The current architecture does not utilize weight pruning or quantization; these remain as potential future extensions (Belousov, 2021).
6. On-Device Inference Performance and Deployment Considerations
MobileStyleGAN is explicitly designed for deployment on edge devices, yielding dramatic improvements in latency and memory footprint:
On an Intel i5-8279U (PyTorch):
- StyleGAN2: 4.3 s / sample
- MobileStyleGAN: 1.2 s / sample
- On the same CPU with OpenVINO optimization: MobileStyleGAN achieves 0.16 s / sample.
- Model download size and peak memory use decrease by 3.5×.
A trade-off is observed: while inference is up to 27× faster (depending on runtime environment), FID degrades to 7.75. Nonetheless, generated faces retain high perceptual quality suitable for many on-device use cases (Belousov, 2021).
7. Summary and Significance
MobileStyleGAN demonstrates that migrating the synthesis process to the wavelet domain, leveraging depthwise separable modulated convolutions with fixed demodulation at inference, and simplifying generator heads collectively yield a style-based GAN that is 3.5× smaller and 9.5× faster in computation than StyleGAN2. Although there is a quantifiable decrease in FID, visual quality remains robust, supporting high-fidelity face generation within strict resource budgets applicable to edge deployment scenarios (Belousov, 2021).