Audio Spectrogram Representations for Processing with Convolutional Neural Networks

Published 29 Jun 2017 in cs.SD, cs.LG, cs.MM, and cs.NE | (1706.09559v1)

Abstract: One of the decisions that arise when designing a neural network for any application is how the data should be represented in order to be presented to, and possibly generated by, a neural network. For audio, the choice is less obvious than it seems to be for visual images, and a variety of representations have been used for different applications including the raw digitized sample stream, hand-crafted features, machine discovered features, MFCCs and variants that include deltas, and a variety of spectral representations. This paper reviews some of these representations and issues that arise, focusing particularly on spectrograms for generating audio using neural networks for style transfer.