GAN You Hear Me? Reclaiming Unconditional Speech Synthesis from Diffusion Models

Published 11 Oct 2022 in cs.SD, cs.AI, and eess.AS | (2210.05271v1)

Abstract: We propose AudioStyleGAN (ASGAN), a new generative adversarial network (GAN) for unconditional speech synthesis. As in the StyleGAN family of image synthesis models, ASGAN maps sampled noise to a disentangled latent vector which is then mapped to a sequence of audio features so that signal aliasing is suppressed at every layer. To successfully train ASGAN, we introduce a number of new techniques, including a modification to adaptive discriminator augmentation to probabilistically skip discriminator updates. ASGAN achieves state-of-the-art results in unconditional speech synthesis on the Google Speech Commands dataset. It is also substantially faster than the top-performing diffusion models. Through a design that encourages disentanglement, ASGAN is able to perform voice conversion and speech editing without being explicitly trained to do so. ASGAN demonstrates that GANs are still highly competitive with diffusion models. Code, models, samples: https://github.com/RF5/simple-asgan/.

Abstract PDF Upgrade to Chat

Citations (8)

View on Semantic Scholar

Summary

The paper introduces AudioStyleGAN (ASGAN), a novel GAN that leverages StyleGAN-inspired architecture to outperform diffusion models in unconditional speech synthesis.
It employs Fourier feature layers, styled convolutions, and anti-aliasing filters to achieve latent space disentanglement and enable zero-shot voice conversion.
Evaluations on the Google Speech Commands dataset show improved synthesis quality, diversity, and faster inference compared to state-of-the-art methods.

Exploring AudioStyleGAN: Advancements in Unconditional Speech Synthesis

The paper presents AudioStyleGAN (ASGAN), a novel generative adversarial network (GAN) model tailored for unconditional speech synthesis. This work proposes an intriguing shift from the prevalent use of diffusion models in the field, suggesting that with appropriate design and training strategies, GANs can achieve superior performance while offering faster training and inference speed.

Model Architecture and Methodology

ASGAN is inspired by the StyleGAN family from image synthesis, adapting its principles to audio. The model follows the traditional GAN setup, comprising a generator and discriminator. The generator leverages a latent mapping network that transforms Gaussian noise into a disentangled latent space, which is subsequently used to generate a sequence of audio features. It introduces Fourier feature layers and styled convolutional blocks, reflecting a thoughtful adaptation of StyleGAN3's techniques to handle audio-specific challenges, such as signal aliasing.

Anti-aliasing filters integrated with modulated convolutions are prominent in eliminating aliasing artifacts, while a unique adaptive discriminator updating mechanism stabilizes the training process. This setup facilitates the generation of high-quality and diverse speech samples from noise vectors.

Evaluation and Results

ASGAN's performance is rigorously evaluated using an array of metrics, such as Inception Score, Fréchet Inception Distance, and various measures of latent space disentanglement. The model is tested on the Google Speech Commands dataset, demonstrating improved synthesis quality and diversity compared to current state-of-the-art models like SaShiMi and DiffWave. Notably, both ASGAN's mel-spectrogram and HuBERT feature variants surpass previous approaches in speech naturalness, as evidenced by subjective Mean Opinion Scores.

Furthermore, ASGAN exhibits considerable speed advantages. It generates speech samples significantly faster than autoregressive and diffusion models, attributed to the inherent efficiencies of its convolutional architecture.

Latent Space Disentanglement: Unlocked Capabilities

ASGAN's design emphasizes disentanglement in the latent space. This feature empowers the model to extend beyond unconditional synthesis, enabling tasks such as zero-shot voice conversion and speech editing. Through linear manipulations in its latent space, ASGAN performs these tasks without additional training, converting voices or editing content with considerable flexibility. Such capabilities are tangibly demonstrated, although the paper primarily offers qualitative examples rather than comprehensive quantitative analyses of these extended task performances.

Implications and Future Directions

The introduction of ASGAN underscores the potential of GANs to re-enter the forefront of speech synthesis research. The success of ASGAN may inspire further exploration into GAN-based approaches, potentially catalyzing advancements in other audio applications as well.

However, ASGAN's current design limits it to fixed-length utterances, suggesting the need for scalable solutions capable of generating coherent long-form speech. Future research may focus on optimizing ASGAN for varied lengths and exploring its application in more extensive and diverse speech datasets. Additional work could also refine the model's task-transfer capabilities, providing a quantitative assessment of its performance across unseen tasks.

In conclusion, this research paper positions ASGAN as a compelling alternative to diffusion models in speech synthesis, highlighting its efficiency, quality, and versatility. The model represents a significant step towards more adaptable and resource-efficient speech synthesis solutions.