Mustango: Toward Controllable Text-to-Music Generation

Published 14 Nov 2023 in eess.AS | (2311.08355v3)

Abstract: The quality of the text-to-music models has reached new heights due to recent advancements in diffusion models. The controllability of various musical aspects, however, has barely been explored. In this paper, we propose Mustango: a music-domain-knowledge-inspired text-to-music system based on diffusion. Mustango aims to control the generated music, not only with general text captions, but with more rich captions that can include specific instructions related to chords, beats, tempo, and key. At the core of Mustango is MuNet, a Music-Domain-Knowledge-Informed UNet guidance module that steers the generated music to include the music-specific conditions, which we predict from the text prompt, as well as the general text embedding, during the reverse diffusion process. To overcome the limited availability of open datasets of music with text captions, we propose a novel data augmentation method that includes altering the harmonic, rhythmic, and dynamic aspects of music audio and using state-of-the-art Music Information Retrieval methods to extract the music features which will then be appended to the existing descriptions in text format. We release the resulting MusicBench dataset which contains over 52K instances and includes music-theory-based descriptions in the caption text. Through extensive experiments, we show that the quality of the music generated by Mustango is state-of-the-art, and the controllability through music-specific text prompts greatly outperforms other models such as MusicGen and AudioLDM2.

Abstract PDF Upgrade to Chat

Citations (32)

View on Semantic Scholar

Summary

The paper demonstrates that incorporating a Music-Domain-Knowledge-Informed UNet (MuNet) significantly enhances control over musical attributes such as tempo, key, and chord progression.
It introduces an innovative data augmentation strategy that expands the dataset tenfold to over 52,000 instances with detailed music-theoretical captions.
Mustango outperforms state-of-the-art models in objective evaluations and subjective listening tests, highlighting its potential for precise, high-quality music synthesis.

Overview of "Mustango: Toward Controllable Text-to-Music Generation"

The paper, "Mustango: Toward Controllable Text-to-Music Generation," introduces a novel text-to-music system named Mustango that leverages music-domain-knowledge and diffusion models to enhance the control over music generation with text prompts. Mustango targets the current challenge within text-to-music models of improving the controllability of musical attributes such as tempo, key, and chord progression while maintaining audio quality.

Central to Mustango's architecture is MuNet, a Music-Domain-Knowledge-Informed UNet guidance module, which facilitates the incorporation of specific musical instructions derived from text prompts during the reverse diffusion process. This approach distinguishes Mustango from other models by enabling the generation of music that more accurately aligns with detailed textual instructions. These conditions include chord sequences, beat structures, and tempo settings, going beyond simple style or mood descriptions typically handled by existing systems.

Data Augmentation and MusicBench Dataset

To address the limited availability of open datasets featuring comprehensive music captions, the authors propose an innovative data augmentation strategy. This process transforms musical attributes such as harmony, tempo, and dynamics using advanced Music Information Retrieval (MIR) techniques to extract and enhance the dataset with text descriptions that reflect these features. The resulting MusicBench dataset is tenfold larger than its precursor, MusicCaps, containing over 52,000 instances enriched with detailed music-theoretical descriptions in the text captions.

Experimental Evaluation

Mustango's performance was rigorously evaluated against state-of-the-art text-to-music generation models like MusicGen and AudioLDM2, as well as against variations of the predecessor model, Tango. The evaluation employed both objective metrics (such as Fréchet Distance (FD), Fréchet Audio Distance (FAD), and Kullback-Leibler Divergence (KL)) and subjective listening tests conducted with both general listeners and music experts.

Results indicate that Mustango outshines its competitors not only in audio quality, as demonstrated by lower scores in FAD and KL, but also in its ability to follow complex musical instructions from prompts. The subjective listening studies corroborate these findings, highlighting Mustango's superior musical quality and control over specific musical elements, as perceived by human evaluators.

Implications and Future Work

Mustango represents a notable stride toward highly controllable music generation, contributing to both theoretical and practical aspects of AI in music. The ability to effectively control musical elements through detailed text instructions opens new possibilities for music creation, offering musicians, sound designers, and producers a powerful tool for composing music that meets precise artistic requirements. Moreover, the open release of the MusicBench dataset provides a valuable resource for further research in this domain.

Future developments could include expanding the system to handle longer pieces of music, facilitating real-time interactive applications, and exploring control over more nuanced aspects of musical composition. The research also invites exploration into culturally diverse music datasets, potentially enhancing the model’s capability to generate a wider array of global music styles.

In conclusion, Mustango sets a practical precedent for integrating domain-specific knowledge into diffusion models, demonstrating that careful incorporation of musical structure into learning processes can significantly enhance the fidelity of AI-generated music.