ACE-Step v1.5: Modular Music Generation Model
- ACE-Step v1.5 is an open-source hybrid music foundation model that integrates a planning LM with a diffusion transformer for high-quality song synthesis.
- It utilizes intrinsic reinforcement learning and chain-of-thought blueprinting to ensure controlled, multilingual music generation with low latency.
- The model supports real-time editing and LoRA-based personalization, offering a modular workflow for commercial-grade music creation on consumer hardware.
ACE-Step v1.5 is an open-source music foundation model that combines efficient large-scale generation, advanced controllability, and robust alignment mechanisms, providing a high-fidelity music generation pipeline capable of producing commercial-grade results using consumer-grade hardware. The system is architected as a hybrid of a planning-focused LLM (LM) and a Diffusion Transformer (DiT) acoustic renderer, interconnected via a compressed latent representation and controlled through intrinsic reinforcement learning dynamics. ACE-Step v1.5 achieves sub-10s full-song synthesis at under 4 GB VRAM, supports multilingual prompt handling, editing, and lightweight user personalization through LoRA adapters. The design systematically decouples high-level musical reasoning from low-level audio synthesis, establishing a modular pipeline for generative music workflows (Gong et al., 31 Jan 2026, Gong et al., 28 May 2025).
1. Hybrid Model Architecture and Component Specialization
At the core of ACE-Step v1.5 lies a novel hybrid architecture that separates high-level planning from audio rendering. The LLM component is based on a 1.7B-parameter Qwen LM trained via ChatML. It serves as an "omni-capable planner" with four operational modes:
- Planner Mode: Maps free-form prompts to structured YAML song blueprints, including musical structure, metadata, and lyrics.
- Listener Mode: Decodes latent audio codes to confirm or transcribe metadata and lyrics.
- Co-Pilot/Refiner Modes: Expand or canonicalize blueprints, optimizing user-provided structures.
Song planning is output as a multi-step chain-of-thought (CoT) reasoning: each reasoning step is explicitly enumerated and leads to final YAML-formatted specifications (e.g., bpm, key, structure, lyrics annotated with rationales).
The DiT component is a 2B-parameter hybrid-attention Transformer that performs acoustic denoising over finite scalar-quantized (FSQ) VAE codes, translating high-level blueprints into time-frequency musical waveforms. Odd-numbered layers implement sliding-window local attention, while even-numbered layers use Global Group Query Attention (GQA). Text, timbre, and lyric-code embeddings are injected via cross-attention.
Compression is achieved through a patchified FSQ VAE: continuous 25 Hz latents are quantized to 5 Hz discrete tokens (codebook ≈64K), concatenated with noised target latents and control masks (sequence length 12.5 Hz). The forward diffusion process follows with and a schedule over steps. The DiT network learns to predict noise, minimizing per-step MSE.
<!-- Table: Model Component Summary --> | Component | Role | Key Features | |---------------|-----------------------------|-------------------------------------| | LM (Qwen, 1.7B) | Song planning, metadata, lyrics | CoT YAML, multi-task supervision | | DiT (2B) | Audio synthesis | Hybrid attention, FSQ VAE, patchified tokens |
Dynamic-Shift Distillation in the DiT improves sample trajectory diversity by learning transitions over multiple diffusion steps. A ConvNeXt-based discriminator in latent space sharpens acoustic textures. No classifier-free guidance is needed at inference.
2. Intrinsic Reinforcement Learning for Alignment
ACE-Step v1.5 uses only intrinsic model mechanisms for alignment, discarding external reward models or human preference annotations. Two distinct methods are implemented:
- DiffusionNTF for DiT: The reward signal is the Attention Alignment Score (AAS), computed by dynamic time warping (DTW) over cross-attention maps between lyric tokens and acoustic frames. AAS integrates coverage (lyric-to-frame alignment), monotonicity (discouraging non-causal jumps), and path confidence (attention density). The DiT policy, parameterized by , is updated to maximize .
- GRPO for LM: The LM is rewarded for planning blueprints and reconstructing captions using a PMI-based reward , incentivizing specific, non-generic outputs. The total LM reward is . Group Relative Policy Optimization (GRPO) with grouped updates ensures consistent behavior across multiple languages and modalities.
These strategies enable the system to maintain strong semantic and acoustic alignment, particularly for complex tasks such as lyric-conditioned music generation and structure transfer (Gong et al., 31 Jan 2026).
3. Chain-of-Thought Guidance and Multilingual Generalization
The LM implements explicit CoT-style blueprint generation in Planner Mode: the output reasoning trace is required before the final YAML is emitted. Each CoT step corresponds to a specific musical or lyric attribute, keeping planning grounded in the prompt. Supervision applies joint cross-entropy loss over both CoT steps and final structured fields.
During training, prompt fidelity is enforced by the cross-entropy between requested metadata tokens and LM outputs. For the DiT, AAS is further used as a penalty when generated audio deviates from prompt-specified attributes such as tempo. Intrinsic RL aggregates reward signals per language group, encouraging robust multilingual performance.
To further support over 50 languages, stochastic romanization is used: 50% of non-Roman tokens are phonemized into Latin form during training, improving handling of rare tokens without inflating the vocabulary.
4. Personalization and User Adaptation with LoRA
ACE-Step v1.5 enables user-specific style adaptation through parameter-efficient Low-Rank Adapters (LoRA) integrated in both LM and DiT. The adapter decomposition is with , , . Typical hyperparameters are:
- Rank , LoRA scaling , learning rate
- Training requires only 5–10 short user reference tracks
- Optimization uses AdamW with no weight decay
This adaptation method preserves the core model weights, permitting rapid, data-efficient user personalization for stylistic imitation or canonicalization (Gong et al., 31 Jan 2026).
5. Controllability, Editing, and Real-Time Operations
ACE-Step v1.5 provides a suite of precise editing operations:
- Cover Generation: By masking timbre positions and denoising over melody codes plus stochastic perturbation, new covers are generated while retaining melodic structure.
- Repainting: Segment-specific regeneration is achieved by masking latent patch indices over arbitrary intervals and re-sampling only those segments via DiT.
- Vocal-to-BGM Conversion: Automated stem extraction isolates vocal codes; replaced with noise and demasked, DiT denoising produces background music.
- Mask-based Inference: All operations rely on fine-grained masking in the latent space, enabling surgical edits, stem isolation, and partial regeneration.
Usage is streamlined with a command-line inference API supporting prompts in >50 languages, optional reference audio for timbre cloning, and arbitrary user-provided YAML blueprints. The song specification output can be seamlessly integrated into music production workflows (e.g., DAWs) (Gong et al., 31 Jan 2026, Gong et al., 28 May 2025).
6. Performance, Evaluation, and Comparative Metrics
ACE-Step v1.5 achieves high efficiency in both resource utilization and quality metrics:
- Latency: Sub-2s full-song generation on A100 GPU, <10s on RTX 3090; VRAM footprint <4GB (total: DiT + LM + VAE).
- AudioBox Metrics: CE (clarity), CU (cohesion), PC (pitch correctness), PQ (quality); e.g., CU=8.09, PQ=8.35 (tied for 2nd among open-source).
- SongEval: Cohesion, musicality, memory/recall, classification, naturalness; Coherence=4.72 (tied best).
- Style/Lyric Alignment via ACE-Caption/Reward: 39.1/26.3, a significant increase over prior versions.
- Human evaluation (from previous versions): Musicality, expression, and sound quality metrics consistently match or exceed leading open- and closed-source systems (Gong et al., 28 May 2025).
Through direct comparison with other models (Yue, SongGen, Suno v3, Udio v1), ACE-Step v1.5 exhibits superior or competitive results in musical coherence, lyric intelligibility, and controllability, while substantially improving generation speed.
7. Training, Data, and Advanced Techniques
Model training is performed on 1.8M songs in 19 languages (100K hours), with annotations derived from self-supervised and LLM-based sources (e.g., Whisper 3.0, Qwen-Omni, All-In-One music understanding). Grapheme-to-phoneme conversion, BPE tokenization, and variable-length sampling are used for robust linguistic and musical coverage.
Semantic alignment is enforced by minimizing REPA (Representation Alignment) losses between DiT features and MERT/mHuBERT speech/music representations. Addition of this loss accelerates convergence, improves lyric/melody harmony, and supports in-the-wild robustness.
Key v1.5 refinements introduced include DCAE compression (f8c8, 8 channels), AdaLN-single for model size reduction, 1D convolutions in FFN, logit-normal time sampling for better noise regime coverage, and staged dropout in conditioning stacks. Curriculum on mHuBERT weight preserves instrumental fidelity during fine-tuning.
Optimization relies on AdamW, ZeRO stage 2 DDP, and smart batching for handling variable-length audio segments (5–240s). Together, these approaches yield ~15% better human-rated coherence, 20% lower FAD error at DCAE output, and 10% end-to-end speedup (Gong et al., 28 May 2025).
ACE-Step v1.5 represents an architectural shift in open-source music generation, offering high performance, controllability, and robust multilingual capabilities. Its blend of structured planning, efficient latent diffusion, and intrinsic RL positions it as a modular foundation platform for music AI research and professional creative workflows (Gong et al., 31 Jan 2026, Gong et al., 28 May 2025).