TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical Timbre Transfer

Published 22 Nov 2018 in cs.SD, cs.LG, eess.AS, and stat.ML | (1811.09620v3)

Abstract: In this work, we address the problem of musical timbre transfer, where the goal is to manipulate the timbre of a sound sample from one instrument to match another instrument while preserving other musical content, such as pitch, rhythm, and loudness. In principle, one could apply image-based style transfer techniques to a time-frequency representation of an audio signal, but this depends on having a representation that allows independent manipulation of timbre as well as high-quality waveform generation. We introduce TimbreTron, a method for musical timbre transfer which applies "image" domain style transfer to a time-frequency representation of the audio signal, and then produces a high-quality waveform using a conditional WaveNet synthesizer. We show that the Constant Q Transform (CQT) representation is particularly well-suited to convolutional architectures due to its approximate pitch equivariance. Based on human perceptual evaluations, we confirmed that TimbreTron recognizably transferred the timbre while otherwise preserving the musical content, for both monophonic and polyphonic samples.

Abstract PDF Upgrade to Chat

Citations (93)

View on Semantic Scholar

Summary

The paper pioneers a method that leverages CQT spectrograms and a CycleGAN architecture to transfer musical timbre with high fidelity.
It employs a conditional WaveNet synthesizer to accurately reconstruct audio waveforms, addressing challenges in phase prediction.
Evaluations show superior performance over STFT-based methods by preserving pitch integrity and minimizing artifacts in diverse audio samples.

An Overview of TimbreTron: A Novel Approach to Musical Timbre Transfer

The paper "TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical Timbre Transfer" presents a sophisticated methodology for achieving high-quality musical timbre transfer—transforming the timbre of audio recordings made with one instrument to match another while preserving the intrinsic musical elements such as pitch and rhythm. The TimbreTron system marks an intersection of recent advancements in neural networks manifesting a successful foray into the domain of audio manipulation through image-based style transfer techniques, adapted for audio signals.

Central to TimbreTron is its innovative use of the Constant Q Transform (CQT), which offers a dual advantage. Firstly, the CQT's inherent pitch equivariance lends itself well to convolutional neural network architectures, allowing effective manipulation of audio signals akin to image processing. Secondly, CQT maintains high frequency resolution at lower frequencies and high temporal resolution at higher frequencies, a feature absent in its counterpart, the Short Time Fourier Transform (STFT). This makes CQT particularly suitable for high-fidelity timbre transfer.

TimbreTron's workflow encompasses three principal steps. Initially, it converts the waveform into a CQT spectrogram and processes this spectrogram as an image by applying CycleGAN for the timbre transfer. CycleGAN, adapted here with modifications such as replacing deconvolution with nearest neighbor interpolation and employing a full-spectrogram discriminator, enables the transfer between unpaired datasets of different instrument recordings. Finally, to transform the modified spectrogram back into an audio waveform, the paper leverages a conditional WaveNet synthesizer, noting CST's difficulty in phase information prediction for accurate waveform reconstruction.

The human evaluations conducted as part of this study underline TimbreTron's efficacy in recognizable timbre transformation with auditory content preservation, achieving accuracy on both monophonic and polyphonic audio samples. The study further contrasts CQT with STFT, revealing marked superiority of the former in preserving pitch integrity and preventing undesired artifacts like random pitch permutations.

In the broader context of AI and audio processing, TimbreTron's methodology highlights the potential of training systems on unpaired data, a common scenario with music recordings. The system’s ability to generalize across synthetic and real-world datasets suggests promising avenues for application in adaptive music synthesis and augmentation, digital music libraries, as well as enhanced tools for musicians and composers to explore new timbres creatively. Future developments could build on this foundation to refine phase prediction techniques further and optimize computational efficiency in real-time applications.

In summary, TimbreTron represents a significant advancement in the application of AI to the intricate domain of musical timbre, showcasing the power of integrating state-of-the-art machine learning strategies in fulfilling complex audio manipulation tasks. The implications of this work extend beyond academic curiosity, heralding transformative potential in music technology and cognitive computing.