Complex Spectrogram Enhancement by Convolutional Neural Network with Multi-Metrics Learning
This paper presents a convolutional neural network (CNN) approach to improve speech enhancement techniques by addressing two major challenges: phase estimation and the limitations of single-objective functions. The authors propose a novel CNN model that enhances complex spectrograms by estimating clean real and imaginary (RI) spectrograms from noisy ones. The reconstruction of RI spectrograms allows for the synthesis of speech waveforms with improved phase information. Furthermore, the paper introduces a multi-metrics learning (MML) approach, optimizing both segmental signal-to-noise ratio (SSNR) and log-spectral distortion (LSD) through a unified objective function that incorporates log-power spectrogram (LPS) reconstruction.
The research highlights that traditional deep learning methods often neglect phase processing, even though phase has shown to be crucial for perceptual quality in speech enhancement, as demonstrated by Roux's and Paliwal's studies. In contrast to earlier models focusing solely on magnitudes, this study leverages RI spectrograms to jointly enhance speech magnitude and phase.
The proposed CNN model treats real and imaginary channels of complex spectrograms similarly to RGB channels in image processing, concentrating on local patterns to extract useful features. This method contrasts with deep neural networks (DNN) that fully connect all inputs, showing superior feature extraction capabilities.
The multi-metrics learning approach employed allows simultaneous optimization of multiple metrics without the trade-offs seen in typical multi-objective optimization problems, as the targets in MML are non-conflicting.
The experimental section uses the TIMIT corpus to validate the findings, with results indicating that the CNN model with multi-metrics learning markedly improves speech quality and intelligibility metrics: LSD, SSNR, STOI, and PESQ. Particularly, RI-CNN significantly outperforms traditional DNN models by providing better generalization and improved SSNR.
A noteworthy part of the paper is the investigation into the impact of phase enhancement under different signal-to-noise ratio (SNR) conditions. The findings suggest that using noisy phase information poses greater degradation at low SNR levels, emphasizing the necessity of efficient phase processing techniques for enhanced speech synthesis.
From a theoretical perspective, the research advances understanding of how complex spectrogram processing can overcome traditional constraints on phase estimation. Practically, it proposes a significant enhancement strategy for speech enhancement systems that can be adopted in various applications requiring robust speech processing.
Future directions noted by the authors include integrating STOI and PESQ into the objective function to form a more comprehensive multi-metrics learning framework and exploring alternative configurations for the objective function beyond weighted sums.
In summary, this paper makes a substantial contribution to the field of speech enhancement by proposing an innovative CNN approach that effectively incorporates phase information and multi-metrics optimization, promising to aid future advancements of intelligent audio systems.