Instabilities in Convnets for Raw Audio

Published 11 Sep 2023 in cs.LG, cs.SD, and eess.AS | (2309.05855v4)

Abstract: What makes waveform-based deep learning so hard? Despite numerous attempts at training convolutional neural networks (convnets) for filterbank design, they often fail to outperform hand-crafted baselines. These baselines are linear time-invariant systems: as such, they can be approximated by convnets with wide receptive fields. Yet, in practice, gradient-based optimization leads to suboptimal approximations. In our article, we approach this phenomenon from the perspective of initialization. We present a theory of large deviations for the energy response of FIR filterbanks with random Gaussian weights. We find that deviations worsen for large filters and locally periodic input signals, which are both typical for audio signal processing applications. Numerical simulations align with our theory and suggest that the condition number of a convolutional layer follows a logarithmic scaling law between the number and length of the filters, which is reminiscent of discrete wavelet bases.

Abstract PDF HTML Upgrade to Chat

References (22)

Citations (1)

View on Semantic Scholar

Summary

The paper reveals that initialization issues cause large deviations in energy response of CNN filters, undermining raw audio processing stability.
It demonstrates that the convolutional layer’s condition number scales logarithmically with filter count and length, affecting network performance.
The study underscores the need for robust initialization strategies to close the performance gap between CNN approaches and hand-crafted filterbanks.

The paper "Instabilities in Convnets for Raw Audio" (2309.05855) explores the complexities of training convolutional neural networks (CNNs) specifically for raw audio applications. The core issue addressed is why CNNs often struggle to outperform hand-crafted baselines in tasks such as filterbank design, even though these baselines are linear time-invariant systems that could theoretically be emulated by CNNs with wide receptive fields.

The authors approach this problem from the perspective of initialization and present a theoretical framework for understanding large deviations in the energy response of finite impulse response (FIR) filterbanks with random Gaussian weights. They identify that these deviations become more problematic for larger filters and for input signals that are locally periodic—both common characteristics in audio signal processing. Their findings suggest that the condition number of a convolutional layer exhibits a logarithmic scaling relationship with both the number and length of filters, echoing properties seen in discrete wavelet bases.

For context, modeling raw audio waveforms has been an active research area, with various approaches explored in the literature:

WaveNet: A probabilistic, autoregressive model for raw audio that has shown remarkable performance in text-to-speech applications (Oord et al., 2016).
SaShiMi: A multi-scale architecture leveraging state-space models, which addresses instabilities in autoregressive generation for raw audio (Goel et al., 2022).
Very Deep CNNs: These aim to use extensive convolutional layers to capture higher-level features directly from time-domain waveforms (Dai et al., 2016).

In addition to these, techniques like FloWaveNet have been proposed to mitigate the inefficiencies of ancestral sampling in models like WaveNet, providing more efficient real-time audio synthesis by leveraging flow-based generative models (Kim et al., 2018).

In summary, while there are ongoing advancements in modeling raw audio with neural networks, the paper highlights inherent initialization-related instabilities as a critical challenge for CNN-based approaches in this domain. This insight helps frame future research directions aimed at creating more robust and effective models for raw audio processing.

Markdown Report Issue