InspireMusic: Integrating Super Resolution and Large Language Model for High-Fidelity Long-Form Music Generation

Published 28 Feb 2025 in cs.SD, cs.AI, cs.CL, and eess.AS | (2503.00084v1)

Abstract: We introduce InspireMusic, a framework integrated super resolution and LLM for high-fidelity long-form music generation. A unified framework generates high-fidelity music, songs, and audio, which incorporates an autoregressive transformer with a super-resolution flow-matching model. This framework enables the controllable generation of high-fidelity long-form music at a higher sampling rate from both text and audio prompts. Our model differs from previous approaches, as we utilize an audio tokenizer with one codebook that contains richer semantic information, thereby reducing training costs and enhancing efficiency. This combination enables us to achieve high-quality audio generation with long-form coherence of up to $8$ minutes. Then, an autoregressive transformer model based on Qwen 2.5 predicts audio tokens. Next, we employ a super-resolution flow-matching model to generate high-sampling rate audio with fine-grained details learned from an acoustic codec model. Comprehensive experiments show that the InspireMusic-1.5B-Long model has a comparable performance to recent top-tier open-source systems, including MusicGen and Stable Audio 2.0, on subjective and objective evaluations. The code and pre-trained models are released at https://github.com/FunAudioLLM/InspireMusic.

Abstract PDF Upgrade to Chat

Summary

The paper introduces InspireMusic, a novel framework combining audio tokenization, an autoregressive transformer, and super-resolution flow-matching to generate high-fidelity music up to 8 minutes.
InspireMusic employs WavTokenizer for efficient sequence processing, an AR transformer for long-term coherence, and a SRFM model to reconstruct fine-grained audio from coarse tokens.
Evaluations demonstrate that the InspireMusic-1.5B-Long model outperforms existing models like MusicGen and Stable Audio 2.0 in subjective quality and musicality metrics, achieving higher Comparative Mean Opinion Scores.

This paper introduces InspireMusic, a novel framework for high-fidelity long-form music generation, which combines super-resolution techniques with a LLM. The system is composed of three primary components: audio tokenizers, an autoregressive transformer, and a super-resolution flow-matching model. The framework is designed to generate controllable, high-fidelity audio with long-form coherence, achieving up to 8 minutes of continuous music.

The paper highlights the limitations of existing music generation models, noting that while some excel in capturing long-form musical structures, they often struggle with audio fidelity, while others offer high-quality audio but may lack global coherence. InspireMusic aims to bridge this gap by integrating these different generative paradigms.

Key elements of the InspireMusic framework include:

Audio Tokenization: The framework employs WavTokenizer, which compresses 24kHz audio into discrete tokens at a 75Hz token rate, using only one codebook at 0.9 kbps bandwidth. WavTokenizer captures global musical structure and facilitates efficient training and inference for the autoregressive model. The WavTokenizer uses a VQ approach, broader contextual windows, improved attention networks, and a multi-scale discriminator along with an inverse FFT (Fast Fourier Transform) in the decoder.
Autoregressive Transformer: The core of InspireMusic is an AR transformer, utilizing the Qwen 2.5 model series as its backbone LLM. The model predicts the next audio token in a sequence, conditioned on preceding tokens, to generate long sequences with coherence. The transformer is trained using a next-token prediction objective, conditioned on various inputs such as text descriptions ( $s_t$ ), timestamps including time start ( $ts$ ) and time end ( $te$ ), music structures ( $s$ ), label ( $l$ ), and audio tokens ( $s_a$ ), represented as $S = \{s_{t}^1, s_{t}^2, \cdots, s_{t}^m, s_{ts}, s_{te}, s_{s}, s_{l}, s_{a}^1, s_{a}^2, \cdots, s_{a}^n\}$ , where $T=m+n+4$ . The input dimension sizes of 0.5B and 1.5B models are $896$ and $1536$, respectively.
Super-Resolution Flow-Matching: A SRFM model enhances low-resolution coarse audio tokens to high-resolution fine-grained audio outputs by learning optimal transformation paths between distributions. Unlike iterative methods, SRFM employs flow matching techniques to directly model the mapping from coarse audio tokens from low sampling rate audio waveforms to fine-grained high-resolution latent audio features extracted from audio with a higher sampling rate (i.e., $48kHz$) via a $150Hz$ Hifi-Codec model.

For the $150Hz$ Hifi-Codec model, given a single channel audio sequence $X$ with the duration of $D$ as the inputs, an Encoder network $E$ takes the raw audio inputs and transforms them into hidden features $H$ , a group residual quantization layer $Q$ with the codebook size of $4$ and each codebook dimension of $C$ , and a decoder $G$ that reconstruct the audio signal from the compressed latent features, where in this study $H=1024$ and $C=1024$ .
Model Variants: The paper details several variants of InspireMusic, including InspireMusic-0.5B, InspireMusic-1.5B, and InspireMusic-1.5B-Long, each tailored for different performance levels and composition lengths.
Training Procedure: The training process involves multiple stages, including training audio tokenizers, the autoregressive transformer model, and the flow-matching model. The autoregressive transformer model undergoes pre-training on large-scale audio-text paired datasets, followed by fine-tuning on curated datasets with human-labeled text captions. The SRFM model trains using paired low- and high-resolution audio tokens to learn the upscaling transformation.

The models were evaluated using both objective and subjective metrics. Objective metrics included FD, KL divergence, and the CLAP (Contrastive Language-Audio Pre-training) score. Subjective evaluations were based on the Comparative Mean Opinion Score (CMOS) from professional music raters, considering audio-text alignment, audio quality, musicality, and overall performance.

The paper includes results from text-to-music and music continuation tasks, demonstrating that the InspireMusic-1.5B-Long model outperforms MusicGen and Stable Audio 2.0 across several evaluation dimensions. For example, in subjective evaluations for the text-to-music task, InspireMusic-1.5B-Long achieves a CMOS score that is 7% higher relative to Stable Audio 2.0 and shows a 14% improvement over InspireMusic-0.5B. Additionally, InspireMusic-1.5B-Long surpasses InspireMusic-0.5B by 6.5% in CMOS score for the same task.

Ablation studies were conducted to assess the contribution of each component, revealing that removing the SRFM model results in a notable drop in audio fidelity. Evaluations also explored the impact of different Classifier-Free Guidance (CFG) values and audio generation lengths on model performance.

Markdown