HiFi-GAN: High-Fidelity Neural Vocoder
- HiFi-GAN is a high-fidelity neural vocoder that uses a GAN-based non-autoregressive approach to convert mel-spectrograms into natural audio with a compact parameter footprint.
- Its generator architecture features convolutional upsampling and multi-receptive-field modules, while parallel discriminators capture both fine temporal details and long-range dependencies.
- Extended variants provide controllable speaking rate and pitch, making HiFi-GAN versatile for applications like text-to-speech, denoising, dereverberation, and hybrid modeling.
HiFi-GAN is a high-fidelity, computationally efficient, non-autoregressive neural vocoder framework based on generative adversarial networks (GANs). It is designed for mapping low-dimensional acoustic representations (typically mel-spectrograms) into high-quality, natural audio waveforms at high speed with a small parameter footprint. Originally proposed for speech synthesis and mel-spectrogram inversion, HiFi-GAN’s modular architecture and loss functions enable strong generalization, fast inference, and integration into a range of speech generation systems, including text-to-speech, denoising, dereverberation, and hybrid modeling pipelines. The framework has evolved with variants supporting controllable speaking rate, pitch, and improved time-frequency discrimination.
1. Generator and Discriminator Architectures
1.1 Generator Design
HiFi-GAN’s generator is a fully-convolutional (no recurrences or large fully-connected layers) feed-forward network that takes an acoustic feature sequence (typically an 80-band mel-spectrogram ) and synthesizes the corresponding waveform. The key components are:
- Pre-Net: An initial 1D convolution projects the mel-spectrogram to a hidden representation.
- Upsampling Backbone: Four upsampling blocks, each consisting of transposed convolutions, sequentially upscale the time axis to match the audio sample rate. For example, upsampling factors may be (for a total upscaling of ) (Xin et al., 2022, Kong et al., 2020).
- Multi-Receptive-Field (MRF) Modules: After each upsampling, a stack of residual blocks (commonly ), each with different dilation rates and kernel sizes, captures diverse temporal contexts. Each such block contains:
- A dilated convolution (with exponentially spaced dilations, e.g., )
- Gated activations and convolutions for local mixing.
- Output Projection: A final convolution outputs the time-domain waveform.
This configuration gives the generator a large receptive field and the ability to model both local and long-range acoustic dependencies (Kong et al., 2020, Yoneyama et al., 2022).
1.2 Discriminators
HiFi-GAN employs two types of parallel discriminators in a GAN setting:
- Multi-Period Discriminator (MPD): Comprises multiple sub-discriminators (typically for periods ), each reshaping the 1D waveform into a 2D tensor and applying 2D CNNs. This design allows explicit modeling of periodic structures, capturing sinusoidal and harmonic content critical for naturalness (Kong et al., 2020, Xin et al., 2022).
- Multi-Scale Discriminator (MSD): Three 1D convolutional networks process the waveform at original, half, and quarter sample rates to capture both fine temporal details and global structure.
Some extensions augment this setup with additional time-frequency discriminators using representations such as multi-scale sub-band Constant-Q Transform (MS-SB-CQT) (Gu et al., 2023).
Discriminator Losses
Discriminators are trained using either the hinge loss or least-squares GAN (LS-GAN) objective: while the adversarial generator loss is
2. Loss Functions and Training Protocols
HiFi-GAN’s generator and discriminators are optimized with a composite objective, combining adversarial losses and feature-level metrics to guide both fidelity and perceptual alignment:
- Adversarial Loss: From MPD and MSD; either hinge or LSGAN (Kong et al., 2020, Yoneyama et al., 2022).
- Feature-Matching Loss: L1 norm between discriminator feature maps for real and generated audio:
This term stabilizes training and enforces perceptually salient structure.
- Mel-Spectrogram L1 Loss: L1 norm between generated and reference mel-spectrograms, ensuring correspondence at the acoustic feature level.
- Overall Generator Loss:
Hyperparameters commonly used are , (Kong et al., 2020, Xin et al., 2022).
The training protocol typically involves data augmentation, multi-stage adversarial schedules, and Adam/AdamW optimizers. HiFi-GAN is trained on paired audio–mel-spectrogram data and can generalize to unseen speakers, languages, and distortions with robust performance (Mascardi et al., 2020, Srivastava et al., 2023).
3. Architectural Variants and Controllability Extensions
3.1 Speaking-Rate Controllable HiFi-GAN
A differentiable, parameter-free “feature interpolation” layer enables speaking-rate control without modifying the core generator or training process (Xin et al., 2022). This interpolation can be inserted:
- After the mel-spectrogram input, or
- Between any upsampling blocks.
Two interpolation types are used:
- 1D Bandlimited Signal Resampling: Sinc + windowed filter.
- 2D Linear (Image-style) Interpolation: Bilinear scaling on the temporal axis.
Let be a feature map and the rate scaling factor, then the interpolated feature is
This modification allows smooth, real-time control of speech tempo at inference. Empirically, image-style interpolation of input mel-spectrograms yields the best fidelity (MCD ≈ 2.20 dB; MOS comparable to unmodified HiFi-GAN) and negligible overhead (Xin et al., 2022).
3.2 Source-Filter HiFi-GAN for Pitch Control
Incorporates source-filter decomposition into the HiFi-GAN generator (Yoneyama et al., 2022). The architecture splits into a source-network (generating pitch-synchronous excitations from sine/F₀) and a filter-network (a HiFi-GAN generator modified to hierarchically fuse source features). This yields robust F₀ control under both copy synthesis and extreme pitch shifting, outperforming vanilla HiFi-GAN and uSFGAN in MOS and objective error while matching or surpassing WORLD and hn-uSFGAN in inference speed.
Key regularization: with being reference and predicted excitation spectra (Yoneyama et al., 2022).
3.3 Time-Frequency Discriminator Innovations
The MS-SB-CQT discriminator, based on multi-scale constant-Q transforms and sub-band octaves, is integrated either as a sole or auxiliary discriminator to enhance harmonic modeling, pitch accuracy, and MOS, especially in singing voice synthesis (Gu et al., 2023).
4. Applications and Integrations
HiFi-GAN is deployed across diverse speech technology domains:
- Neural Vocoding: The canonical use case involves mel-spectrogram inversion in neural TTS pipelines (e.g., as in FastSpeech2, JETS), yielding audio with MOS close to natural speech (Lim et al., 2022).
- Denoising and Dereverberation: HiFi-GAN can operate as a waveform-enhancing model, outperforming prior GAN and neural methods on both speech quality (PESQ, STOI) and naturalness (Su et al., 2020).
- Hybrid HMM–Neural TTS: High-fidelity synthesis with small computational footprint is achieved by coupling HMM-based feature generation with HiFi-GAN waveform synthesis, suitable for resource-limited applications (e.g., Indic languages with 72 MB total footprint, DMOS ≈ 4.0) (Srivastava et al., 2023).
- Joint Acoustic-Generative Modeling: In systems like JETS, HiFi-GAN is integrated with upstream acoustic and alignment modules, enabling fully end-to-end text-to-waveform training with improved naturalness and lower error rates versus cascaded pipelines (Lim et al., 2022).
- Speaking Rate and Pitch Control: The feature interpolation and source-filter HiFi-GAN variants provide fine-grained, real-time prosody manipulation, facilitating accessible and expressive TTS (Xin et al., 2022, Yoneyama et al., 2022).
5. Experimental Benchmarks
Extensive empirical evaluation demonstrates HiFi-GAN’s competitiveness in both objective and subjective criteria:
| Metric | Original HiFi-GAN | HiFi-GAN + Interpolation | WSOLA (baseline) | HiFi-GAN + Source-Filter |
|---|---|---|---|---|
| MOS (TTS, LJS) | 4.36 ± 0.07 | ≈4.2 | <4.0 | ≤3.89 ± 0.05 |
| Inference RTF | 0.01 (V100 GPU) | ≈0.01 | ≫0.01 | 0.63 (EPYC, CPU) |
| Parameters | ≈3.8 M–14 M | Same | n/a | 9.7 M |
| MCD (dB) | – | 2.20 (best) | 2.37 | – |
Subjective tests consistently favor HiFi-GAN (or its modulated variants) over classical waveform modification and other neural vocoders for moderate speaking rate or pitch transformations (Xin et al., 2022, Yoneyama et al., 2022). Joint training with MS-SB-CQT discriminators further boosts MOS to 3.87 (seen singers) and 3.78 (unseen) (Gu et al., 2023).
6. Limitations and Future Directions
Recognized limitations include:
- Performance drops at extreme rate or pitch shifts for all vocoders, influenced by listener unfamiliarity and missing training modes (Xin et al., 2022, Yoneyama et al., 2022).
- Pitch control in original HiFi-GAN is limited without explicit source conditioning.
- Small models (HiFi-GAN V2, V3) slightly underperform V1 in MOS but offer dramatically smaller footprints and real-time CPU operation (Kong et al., 2020).
Future research is indicated in:
- Joint fine-tuning of time-warp/interpolation layers for extreme prosodic modulation (Xin et al., 2022).
- Expanding applicability of discriminators leveraging variable time-frequency resolutions (e.g., CQT/STFT joint) (Gu et al., 2023).
- Generalizing source-filter HiFi-GAN architectures across TTS and SVS with multilingual and expressive control.
- Quantization/pruning for further mobile deployment and robust alignment learning in end-to-end systems (Kong et al., 2020, Lim et al., 2022).
7. Context and Impact
HiFi-GAN marks a significant advance over prior GAN-based and neural vocoders by integrating effective periodicity modeling (via MPD), large receptive-field convolutions, and strong feature-matching constraints. The architecture achieves a balance of audio fidelity, inference speed (up to 1,000× real-time), and compactness (≤1 M parameters for V2), outperforming flow-based (WaveGlow) or autoregressive (WaveNet) neural vocoders in both MOS and efficiency. Its extensibility enables rapid progress in controllable and high-fidelity speech synthesis, as evidenced by wide adoption in modern TTS, SVS, and speech enhancement pipelines (Kong et al., 2020, Lim et al., 2022, Gu et al., 2023, Xin et al., 2022, Yoneyama et al., 2022, Srivastava et al., 2023).