Perceptually Weighted MR-STFT Loss
- The paper introduces a perceptually weighted MR-STFT loss that integrates a frequency-dependent masking filter to reduce reconstruction errors in critical spectral regions, achieving up to 0.24 MOS improvements.
- The methodology employs multi-resolution STFT settings and psychoacoustic principles to penalize errors in perceptually salient areas such as formant valleys.
- The integration within Parallel WaveGAN training demonstrates measurable gains in Log-Spectral Distance and subjective naturalness while preserving inference speed and model compactness.
Perceptually Weighted MR-STFT (Multi-Resolution Short-Time Fourier Transform) refers to a spectrogram-based loss criterion refined by a psychoacoustic weighting, enabling generative models in text-to-speech (TTS) systems to better align with human auditory perception. The method, introduced and evaluated in Parallel WaveGAN-based vocoders, integrates a frequency-dependent masking filter into a conventional multi-resolution STFT loss. This design penalizes errors in spectro-temporal regions most perceptually salient to listeners, particularly between formant frequencies where masking is minimal and the ear is most sensitive. The technique was developed to reduce auditory noise and enhance the perceived naturalness of generated speech, delivering measurable improvements in both objective and subjective metrics without increasing inference cost (Song et al., 2021).
1. Foundation of MR-STFT Loss
The Parallel WaveGAN generator is optimized using a combination of adversarial loss and MR-STFT loss. Let denote a real waveform, and a generated waveform, with conditioning on input features and noise . Their respective STFTs, and , are computed at each time frame and frequency bin .
The multi-resolution STFT loss is defined as an average of losses at different STFT resolutions:
0
where each 1 is an expectation of two terms:
- Spectral Convergence (SC):
2
- Log-Magnitude (Mag):
3
In practice, 4 is used, with FFT sizes 5, window lengths 6 samples, and frame shifts 7 samples. This approach ensures the generator is sensitive to errors at distinct time-frequency granularities.
2. Psychoacoustic Perceptual Weighting Design
A time-invariant masking filter 8 is derived by linear prediction (LP), targeting perceptually significant frequency regions—principally, “spectral valleys” between formants. The process is as follows:
- All training spectra are converted to line-spectral-frequency (LSF) vectors of order 9.
- These LSFs are averaged across utterances.
- The mean LSF vector is inverted to obtain LP coefficients 0.
- The filter in the 1-domain is
2
The magnitude response is mapped to the STFT frequency bins, generating a matrix 3, with each time frame sharing the same spectral row. Linear normalization restricts all filter values to 4, ensuring no frequencies are entirely suppressed.
3. Perceptually Weighted MR-STFT Loss
The perceptual weight 5 is integrated into both the SC and Mag terms, yielding:
- Weighted SC:
6
- Weighted Mag:
7
The weighted MR-STFT loss at each resolution becomes:
8
and the overall perceptually weighted MR-STFT loss is
9
4. Integration into Parallel WaveGAN Training
The generator objective combines the weighted MR-STFT loss with the adversarial loss of the least-squares GAN (LSGAN) framework:
0
with 1. The adversarial loss is:
2
where 3 is trained to distinguish real from generated waveforms, using the LSGAN objective.
5. Implementation Protocols
The system employs the following configuration:
- STFT settings: FFT sizes 512, 1024, 2048; windows of 240/600/1200 samples; hop sizes of 50/120/240 samples.
- Mask filter: LP order 4, constructed from mean LSFs, normalized to 5.
- Model architecture: Generator/Discriminator with 30 residual blocks (dilations 1–8), totaling 1.83M parameters; inference at 50.6× real-time on NVIDIA V100.
- Optimization: RAdam optimizer, 6; 400k steps; discriminator frozen for first 100k steps. Batch = 8, 1-second (24k sample) utterances; generator learning rate 7, discriminator 8, halved every 200k steps.
6. Quantitative and Subjective Evaluation
The perceptually weighted MR-STFT framework yields:
- Log-Spectral Distance (LSD): Consistent reduction (90.1–0.3 dB) in LSD across the frequency spectrum, notably in perceptually important bands.
- MOS Scores: For Korean female (KRF) and male (KRM) test utterances (mean ± 95% CI):
| Method | KRF MOS (±0.10) | KRM MOS (±0.10) |
|---|---|---|
| Parallel WaveGAN baseline | 4.02 | 4.11 |
| + Perceptual Weighting | 4.26 | 4.21 |
| Real speech (for reference) | ≈ 4.6 | ≈ 4.6 |
A 0.24/0.10 MOS gain is obtained with no alteration to model size or inference speed.
7. Significance and Conclusions
The perceptually weighted MR-STFT loss applies a psychoacoustically motivated, frequency-dependent mask to standard MR-STFT criteria in GAN-based speech synthesis. This methodology explicitly optimizes vocoder generators to reduce reconstruction errors in spectral regions where the ear is most sensitive, particularly the valleys between formant peaks. The result is objectively improved spectral fidelity and subjectively more natural speech, with approximately a 0.2 MOS increase, all while preserving computational throughput and model compactness. Quality approaches that of autoregressive WaveNet models with noise-shaping, yet retains the inference speed characteristic of parallel non-autoregressive frameworks (Song et al., 2021).