End-to-end controllable song generation with multi-condition inputs

Establish an end-to-end controllable song generation approach that jointly conditions on textual style descriptions, lyrics, and reference audio to guide the music synthesis process.

Background

The paper reviews limitations of prior music generation systems, noting that many models provide only coarse control or lack robust alignment between text and audio, and that maintaining long-range coherence is challenging. Within this context, the authors identify the specific difficulty of building a single system that can be jointly guided by multiple inputs—style descriptions, lyrics, and reference audio—while remaining controllable end-to-end.

They position HeartMuLa as a proposed solution to this challenge, integrating HeartCLAP, HeartTranscriptor, and HeartCodec to support multi-conditional generation. The open challenge is explicitly stated in the introduction as the motivation for their framework.

References

Furthermore, end-to-end controllable song generation jointly guided by style descriptions, lyrics, and reference audio remains an open challenge.

HeartMuLa: A Family of Open Sourced Music Foundation Models  (2601.10547 - Yang et al., 15 Jan 2026) in Section 1 (Introduction)