StemGen: A music generation model that listens

Published 14 Dec 2023 in cs.SD, cs.LG, and eess.AS | (2312.08723v2)

Abstract: End-to-end generation of musical audio using deep learning techniques has seen an explosion of activity recently. However, most models concentrate on generating fully mixed music in response to abstract conditioning information. In this work, we present an alternative paradigm for producing music generation models that can listen and respond to musical context. We describe how such a model can be constructed using a non-autoregressive, transformer-based model architecture and present a number of novel architectural and sampling improvements. We train the described architecture on both an open-source and a proprietary dataset. We evaluate the produced models using standard quality metrics and a new approach based on music information retrieval descriptors. The resulting model reaches the audio quality of state-of-the-art text-conditioned models, as well as exhibiting strong musical coherence with its context.

Abstract PDF HTML Upgrade to Chat

References (27)

Citations (18)

View on Semantic Scholar

Summary

The paper introduces a non-autoregressive transformer that listens to existing musical contexts to generate coherent target stems.
It applies novel iterative decoding and multi-source classifier-free guidance techniques, significantly lowering FAD and MIRDD scores.
Extensive experiments on the Slakh and proprietary datasets validate its efficiency and potential for creative collaboration in music production.

StemGen: A Non-Autoregressive Approach to Contextual Music Generation

Introduction

The paper "StemGen: A music generation model that listens" addresses an innovative paradigm in music generation utilizing deep learning. Traditional models focus on generating entire compositions from abstract conditions such as text or style categories. Instead, StemGen introduces a framework where the music generation model listens to existing musical contexts and constructs an appropriate response. The proposed approach employs a non-autoregressive transformer architecture, refining previous methodologies and introducing enhancements for high-fidelity music generation.

Modelling Approach

StemGen’s framework leverages datasets containing separated musical stems, constructing training pairs with a sound context and target stem. The objective is to model the conditional distribution $p(\mathbf{t} | \mathbf{a})$ , where $\mathbf{a}$ is the context-mix, and $\mathbf{t}$ is the target-stem.

Figure 1: Schematic diagram of the StemGen training paradigm.

The training model operates by processing sequences of abstract tokens derived from audio waveforms, sidestepping direct waveform modelling. This synthesis into discrete token sequences allows the use of advanced LLM techniques, specifically leveraging a residual vector quantizer (RVQ).

The presented token combination method ensures the embedding process captures multiple audio channels effectively. It involves producing embeddings for each audio channel and combining these along with ancillary non-audio conditions prior to deployment in the model.

Figure 2: Schematic showing overall architecture of the StemGen model during training.

Novel Sampling Techniques

The paper highlights two core sampling approaches:

Causally Biased Iterative Decoding: This technique incorporates a "fuzzy causality," encouraging early sequence element sampling. The straightforward function $\rho(x_n) = w_c c(x_n) + w_s (1 - n/N) + w_r X$ defines the ranking process, allowing for strategic sampling balance between model confidence, sequence progression, and randomness.
Multi-Source Classifier-Free Guidance: Extends classifier-free guidance by applying it over multiple condition sources independently, such as the audio context and instrumental metadata. This enhances the model's ability to generate musically coherent pieces in response to context-mixes.
Figure 3: Example of iterative decoding for a sequence of a single token level over 8 iterations, showcasing sequences with and without causal-bias.

Experimental Setup and Results

Experiments were conducted on the Slakh dataset and an internal dataset of proprietary music, using a single NVIDIA A100 GPU. The non-autoregressive model demonstrated sound performance across several metrics, including Fréchet Audio Distance (FAD) and a novel Music Information Retrieval Descriptor Distance (MIRDD). Compared to baseline models, the proposed StemGen architecture evidenced competitive performance in audio quality and musical coherence.

Figure 4: Schematic showing how the $Q$ RVQ levels of both the context-mix and the target stem are converted into continuous embeddings using codebooks $E_{1 \hdots Q}$.

Ablation studies affirmed the impact of the novel decoding improvements, primarily showing that guidance over multiple sources significantly enhanced model outcomes, with a notable reduction in FAD and MIRDD scores.

Conclusion

StemGen delineates a new direction in music generation, offering a framework where models can produce music that aligns with pre-existing compositions. Its architecture provides an efficient solution potentially valuable to music producers, opening avenues for creative collaboration with AI. While results affirm StemGen's prowess relative to state-of-the-art models, ongoing refinements and larger datasets might extend its breadth and scalability.

Markdown Report Issue