Papers
Topics
Authors
Recent
Search
2000 character limit reached

StemGen: A music generation model that listens

Published 14 Dec 2023 in cs.SD, cs.LG, and eess.AS | (2312.08723v2)

Abstract: End-to-end generation of musical audio using deep learning techniques has seen an explosion of activity recently. However, most models concentrate on generating fully mixed music in response to abstract conditioning information. In this work, we present an alternative paradigm for producing music generation models that can listen and respond to musical context. We describe how such a model can be constructed using a non-autoregressive, transformer-based model architecture and present a number of novel architectural and sampling improvements. We train the described architecture on both an open-source and a proprietary dataset. We evaluate the produced models using standard quality metrics and a new approach based on music information retrieval descriptors. The resulting model reaches the audio quality of state-of-the-art text-conditioned models, as well as exhibiting strong musical coherence with its context.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. “WaveNet: A Generative Model for Raw Audio,” arXiv, 2016, 1609.03499.
  2. “High Fidelity Neural Audio Compression,” arXiv, 2022, 2210.13438.
  3. “High-Fidelity Audio Compression with Improved RVQGAN,” arXiv, 2023, 2306.06546.
  4. “MusicLM: Generating Music From Text,” arXiv, 2023, 2301.11325.
  5. “Simple and Controllable Music Generation,” arXiv, 2023, 2306.05284.
  6. “VampNet: Music Generation via Masked Acoustic Token Modeling,” arXiv, 2023, 2307.04686.
  7. “Noise2Music: Text-conditioned Music Generation with Diffusion Models,” arXiv, 2023, 2302.03917.
  8. “Multi-instrument Music Synthesis with Spectrogram Diffusion,” arXiv, 2022, 2206.05408.
  9. “Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion,” arXiv, 2023, 2301.11757.
  10. Nicholas Cook, Music, Imagination, and Culture, ACLS Humanities E-Book. Clarendon Press, 1990.
  11. “Jukebox: A Generative Model for Music,” arXiv, 2020, 2005.00341.
  12. “SingSong: Generating musical accompaniments from singing,” arXiv, 2023, 2301.12662.
  13. “Multi-Source Diffusion Models for Simultaneous Music Generation and Separation,” arXiv, 2023, 2302.02257.
  14. “SoundStorm: Efficient Parallel Audio Generation,” arXiv, 2023, 2305.09636.
  15. “CLAP: Learning Audio Concepts From Natural Language Supervision,” arXiv, 2022, 2206.04769.
  16. J. Ho and T. Salimans, “Classifier-Free Diffusion Guidance,” arXiv, 2022, 2207.12598.
  17. “Cutting music source separation some Slakh: A dataset to study the impact of training data quality and quantity,” in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2019.
  18. MIDI Manufacturers Association, “Complete MIDI 1.0 Detailed Specification,” http://www.midi.org/techspecs/gm.php, 1999/2008.
  19. “LLaMA: Open and Efficient Foundation Language Models,” arXiv, 2023, 2302.13971.
  20. “Fréchet Audio Distance: A Reference-Free Metric for Evaluating Music Enhancement Algorithms,” in Proc. Interspeech 2019, 2019, pp. 2350–2354.
  21. M. T. Pearce, The construction and evaluation of statistical models of melodic structure in music perception and composition, Ph.D. thesis, City University London, 2005.
  22. M. Pearce and G. Wiggins, “Expectation in melody: The influence of context and learning,” Music Perception, vol. 23, pp. 377–405, 06 2006.
  23. L.-C. Yang and A. Lerch, “On the evaluation of generative models in music,” Neural Computing and Applications, vol. 32, 05 2020.
  24. “Multitrack Music Transcription with a Time-Frequency Perceiver,” in IEEE Int. Conf. on Acoustics, Speech and Sig. Proc. (ICASSP), 2023, pp. 1–5.
  25. “Modeling Beats and Downbeats with a Time-Frequency Transformer,” in IEEE Int. Conf. on Acoustics, Speech and Sig. Proc. (ICASSP), 2022, pp. 401–405.
  26. W.-T. Lu and J.-C. Wang and M. Won and K. Choi and X. Song, “SpecTNT: a Time-Frequency Transformer for Music Audio,” in International Society for Music Information Retrieval Conference, 2021.
  27. “To Catch A Chorus, Verse, Intro, or Anything Else: Analyzing a Song with Structural Functions,” in IEEE Int. Conf. on Acoustics, Speech and Sig. Proc. (ICASSP), 2022, pp. 416–420.
Citations (18)

Summary

  • The paper introduces a non-autoregressive transformer that listens to existing musical contexts to generate coherent target stems.
  • It applies novel iterative decoding and multi-source classifier-free guidance techniques, significantly lowering FAD and MIRDD scores.
  • Extensive experiments on the Slakh and proprietary datasets validate its efficiency and potential for creative collaboration in music production.

StemGen: A Non-Autoregressive Approach to Contextual Music Generation

Introduction

The paper "StemGen: A music generation model that listens" addresses an innovative paradigm in music generation utilizing deep learning. Traditional models focus on generating entire compositions from abstract conditions such as text or style categories. Instead, StemGen introduces a framework where the music generation model listens to existing musical contexts and constructs an appropriate response. The proposed approach employs a non-autoregressive transformer architecture, refining previous methodologies and introducing enhancements for high-fidelity music generation.

Modelling Approach

StemGen’s framework leverages datasets containing separated musical stems, constructing training pairs with a sound context and target stem. The objective is to model the conditional distribution p(ta)p(\mathbf{t} | \mathbf{a}), where a\mathbf{a} is the context-mix, and t\mathbf{t} is the target-stem. Figure 1

Figure 1: Schematic diagram of the StemGen training paradigm.

The training model operates by processing sequences of abstract tokens derived from audio waveforms, sidestepping direct waveform modelling. This synthesis into discrete token sequences allows the use of advanced LLM techniques, specifically leveraging a residual vector quantizer (RVQ).

The presented token combination method ensures the embedding process captures multiple audio channels effectively. It involves producing embeddings for each audio channel and combining these along with ancillary non-audio conditions prior to deployment in the model. Figure 2

Figure 2: Schematic showing overall architecture of the StemGen model during training.

Novel Sampling Techniques

The paper highlights two core sampling approaches:

  • Causally Biased Iterative Decoding: This technique incorporates a "fuzzy causality," encouraging early sequence element sampling. The straightforward function ρ(xn)=wcc(xn)+ws(1n/N)+wrX\rho(x_n) = w_c c(x_n) + w_s (1 - n/N) + w_r X defines the ranking process, allowing for strategic sampling balance between model confidence, sequence progression, and randomness.
  • Multi-Source Classifier-Free Guidance: Extends classifier-free guidance by applying it over multiple condition sources independently, such as the audio context and instrumental metadata. This enhances the model's ability to generate musically coherent pieces in response to context-mixes. Figure 3

    Figure 3: Example of iterative decoding for a sequence of a single token level over 8 iterations, showcasing sequences with and without causal-bias.

Experimental Setup and Results

Experiments were conducted on the Slakh dataset and an internal dataset of proprietary music, using a single NVIDIA A100 GPU. The non-autoregressive model demonstrated sound performance across several metrics, including Fréchet Audio Distance (FAD) and a novel Music Information Retrieval Descriptor Distance (MIRDD). Compared to baseline models, the proposed StemGen architecture evidenced competitive performance in audio quality and musical coherence. Figure 4

Figure 4: Schematic showing how the QQ RVQ levels of both the context-mix and the target stem are converted into continuous embeddings using codebooks $E_{1 \hdots Q}$.

Ablation studies affirmed the impact of the novel decoding improvements, primarily showing that guidance over multiple sources significantly enhanced model outcomes, with a notable reduction in FAD and MIRDD scores.

Conclusion

StemGen delineates a new direction in music generation, offering a framework where models can produce music that aligns with pre-existing compositions. Its architecture provides an efficient solution potentially valuable to music producers, opening avenues for creative collaboration with AI. While results affirm StemGen's prowess relative to state-of-the-art models, ongoing refinements and larger datasets might extend its breadth and scalability.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 6 tweets with 33 likes about this paper.