Complex spectrogram enhancement by convolutional neural network with multi-metrics learning

Published 27 Apr 2017 in stat.ML, cs.LG, and cs.SD | (1704.08504v2)

Abstract: This paper aims to address two issues existing in the current speech enhancement methods: 1) the difficulty of phase estimations; 2) a single objective function cannot consider multiple metrics simultaneously. To solve the first problem, we propose a novel convolutional neural network (CNN) model for complex spectrogram enhancement, namely estimating clean real and imaginary (RI) spectrograms from noisy ones. The reconstructed RI spectrograms are directly used to synthesize enhanced speech waveforms. In addition, since log-power spectrogram (LPS) can be represented as a function of RI spectrograms, its reconstruction is also considered as another target. Thus a unified objective function, which combines these two targets (reconstruction of RI spectrograms and LPS), is equivalent to simultaneously optimizing two commonly used objective metrics: segmental signal-to-noise ratio (SSNR) and logspectral distortion (LSD). Therefore, the learning process is called multi-metrics learning (MML). Experimental results confirm the effectiveness of the proposed CNN with RI spectrograms and MML in terms of improved standardized evaluation metrics on a speech enhancement task.

Abstract PDF Upgrade to Chat

Citations (167)

View on Semantic Scholar

Summary

Complex Spectrogram Enhancement by Convolutional Neural Network with Multi-Metrics Learning

This paper presents a convolutional neural network (CNN) approach to improve speech enhancement techniques by addressing two major challenges: phase estimation and the limitations of single-objective functions. The authors propose a novel CNN model that enhances complex spectrograms by estimating clean real and imaginary (RI) spectrograms from noisy ones. The reconstruction of RI spectrograms allows for the synthesis of speech waveforms with improved phase information. Furthermore, the paper introduces a multi-metrics learning (MML) approach, optimizing both segmental signal-to-noise ratio (SSNR) and log-spectral distortion (LSD) through a unified objective function that incorporates log-power spectrogram (LPS) reconstruction.

The research highlights that traditional deep learning methods often neglect phase processing, even though phase has shown to be crucial for perceptual quality in speech enhancement, as demonstrated by Roux's and Paliwal's studies. In contrast to earlier models focusing solely on magnitudes, this study leverages RI spectrograms to jointly enhance speech magnitude and phase.

The proposed CNN model treats real and imaginary channels of complex spectrograms similarly to RGB channels in image processing, concentrating on local patterns to extract useful features. This method contrasts with deep neural networks (DNN) that fully connect all inputs, showing superior feature extraction capabilities.

The multi-metrics learning approach employed allows simultaneous optimization of multiple metrics without the trade-offs seen in typical multi-objective optimization problems, as the targets in MML are non-conflicting.

The experimental section uses the TIMIT corpus to validate the findings, with results indicating that the CNN model with multi-metrics learning markedly improves speech quality and intelligibility metrics: LSD, SSNR, STOI, and PESQ. Particularly, RI-CNN significantly outperforms traditional DNN models by providing better generalization and improved SSNR.

A noteworthy part of the paper is the investigation into the impact of phase enhancement under different signal-to-noise ratio (SNR) conditions. The findings suggest that using noisy phase information poses greater degradation at low SNR levels, emphasizing the necessity of efficient phase processing techniques for enhanced speech synthesis.

From a theoretical perspective, the research advances understanding of how complex spectrogram processing can overcome traditional constraints on phase estimation. Practically, it proposes a significant enhancement strategy for speech enhancement systems that can be adopted in various applications requiring robust speech processing.

Future directions noted by the authors include integrating STOI and PESQ into the objective function to form a more comprehensive multi-metrics learning framework and exploring alternative configurations for the objective function beyond weighted sums.

In summary, this paper makes a substantial contribution to the field of speech enhancement by proposing an innovative CNN approach that effectively incorporates phase information and multi-metrics optimization, promising to aid future advancements of intelligent audio systems.