A Fully Convolutional Neural Network for Speech Enhancement

Published 22 Sep 2016 in cs.LG | (1609.07132v1)

Abstract: In hearing aids, the presence of babble noise degrades hearing intelligibility of human speech greatly. However, removing the babble without creating artifacts in human speech is a challenging task in a low SNR environment. Here, we sought to solve the problem by finding a `mapping' between noisy speech spectra and clean speech spectra via supervised learning. Specifically, we propose using fully Convolutional Neural Networks, which consist of lesser number of parameters than fully connected networks. The proposed network, Redundant Convolutional Encoder Decoder (R-CED), demonstrates that a convolutional network can be 12 times smaller than a recurrent network and yet achieves better performance, which shows its applicability for an embedded system: the hearing aids.

Abstract PDF Upgrade to Chat

Citations (352)

View on Semantic Scholar

Summary

The paper introduces the R-CED network that learns a direct mapping from noisy to clean speech spectra using a fully convolutional architecture.
It demonstrates that bypassing pooling layers and using skip connections significantly enhances denoising performance in harsh babble noise conditions.
Experimental evaluations on the TIMIT corpus show improved SDR, STOI, and PESQ scores while reducing model size for embedded hearing aid applications.

A Fully Convolutional Neural Network for Speech Enhancement

This paper addresses the problem of enhancing speech quality in environments dominated by babble noise, particularly for applications in hearing aids. The authors propose a method utilizing Convolutional Neural Networks (CNNs) for speech denoising, circumventing the common challenge of model size faced by fully connected and recurrent networks. Specifically, the Redundant Convolutional Encoder Decoder (R-CED) network is introduced as a lightweight, effective solution for embedded systems such as hearing aids.

Problem Statement

The task is to develop a mapping function $f$ that transforms noisy speech spectra into clean spectra with minimal artifacts, optimizing the $\ell_2$ norm between the denoised output and the original clean spectra. Traditional methods focus on noise modeling, which often fails under babble noise conditions due to the mixture of overlapping speech signals. The paper instead aims to learn a direct mapping using a CNN-based approach.

Figure 1: Speech Enhancement Using a CNN.

Convolutional Network Architectures

Convolutional Encoder-Decoder Network (CED)

The CED network utilizes symmetric encoder and decoder layers, each defined by sequences of convolution, batch normalization, ReLU activation, and pooling operations. Notably, the final layer is modified to a convolution layer, which makes the network fully convolutional.

Redundant CED Network (R-CED)

The proposed R-CED network avoids pooling layers, using only convolution and batch normalization, which allows it to handle input spectra directly in higher dimensions, akin to kernel methods. The architecture prioritizes redundancy and compression through an asymmetrical increase and decrease in the number of filters towards the decoder end.

Figure 2: Modified Convolutional Encoder-Decoder Network (CED).

Experimental Setup

Experiments were performed using the TIMIT corpus with 27 noise types superimposed at 0 dB SNR. The evaluation metrics included Signal-to-Noise Ratio (SDR), Short-Time Objective Intelligibility (STOI), and Perceptual Evaluation of Speech Quality (PESQ), providing a balanced view of objective and subjective quality measures.

Results

Three primary tests were conducted: comparing CNN against FNN and RNN, evaluating CED versus R-CED architectures, and optimizing R-CED performance.

Test 1: CNNs achieved equivalent performance to FNN and RNN with a drastically reduced parameter count, thus making them suitable for embedded systems.
Test 2: R-CED outperformed CED in terms of denoising capability, particularly when skip connections were implemented.
Test 3: Higher CNN performance was directly related to network size and depth, with the CR-CED architecture showing the most promise within the tested configurations.

Figure 3: Noisy Spectrogram.

Conclusion

The application of CNNs to speech denoising in low-resource settings is validated through the presented work, demonstrating significant improvements in model efficiency without sacrificing performance. The R-CED structure with bypass connections provides effective speech enhancement suitable for real-time applications in hearing aids. Future work will focus on optimizing R-CED for computational cost, which remains vital for further deployment in constrained environments.