SongEcho: Towards Cover Song Generation via Instance-Adaptive Element-wise Linear Modulation

Published 23 Feb 2026 in cs.SD | (2602.19976v1)

Abstract: Cover songs constitute a vital aspect of musical culture, preserving the core melody of an original composition while reinterpreting it to infuse novel emotional depth and thematic emphasis. Although prior research has explored the reinterpretation of instrumental music through melody-conditioned text-to-music models, the task of cover song generation remains largely unaddressed. In this work, we reformulate our cover song generation as a conditional generation, which simultaneously generates new vocals and accompaniment conditioned on the original vocal melody and text prompts. To this end, we present SongEcho, which leverages Instance-Adaptive Element-wise Linear Modulation (IA-EiLM), a framework that incorporates controllable generation by improving both conditioning injection mechanism and conditional representation. To enhance the conditioning injection mechanism, we extend Feature-wise Linear Modulation (FiLM) to an Element-wise Linear Modulation (EiLM), to facilitate precise temporal alignment in melody control. For conditional representations, we propose Instance-Adaptive Condition Refinement (IACR), which refines conditioning features by interacting with the hidden states of the generative model, yielding instance-adaptive conditioning. Additionally, to address the scarcity of large-scale, open-source full-song datasets, we construct Suno70k, a high-quality AI song dataset enriched with comprehensive annotations. Experimental results across multiple datasets demonstrate that our approach generates superior cover songs compared to existing methods, while requiring fewer than 30% of the trainable parameters. The code, dataset, and demos are available at https://github.com/lsfhuihuiff/SongEcho_ICLR2026.

Abstract PDF Upgrade to Chat

Summary

The paper introduces IA-EiLM, an instance-adaptive modulation framework that enables precise temporal alignment of melody conditions in cover song generation.
It integrates a Diffusion Transformer backbone with a custom melody encoder to achieve significant improvements in pitch accuracy and overall audio fidelity.
The approach offers parameter-efficient modulation and robust melody control, setting a new benchmark for controllable cover song synthesis.

SongEcho: Instance-Adaptive Modulation for Controllable Cover Song Generation

Problem Formulation and Motivation

Cover song generation, characterized by the simultaneous synthesis of new vocals and coherent accompaniment conditioned on the original vocal melody and textual prompts, remains an open challenge distinct from traditional Singing Voice Synthesis (SVS) and Singing Voice Conversion (SVC). Existing approaches are limited by single-track focus or fail to achieve precise temporal melody control, lyric synchronization, and natural instrumental coherence. The principal bottleneck lies in robust condition injection—the mechanism by which melody control is integrated into generative backbones.

IA-EiLM: Instance-Adaptive Element-wise Linear Modulation

SongEcho introduces IA-EiLM, comprising Element-wise Linear Modulation (EiLM) and Instance-Adaptive Condition Refinement (IACR). Traditional condition injection schemes, such as cross-attention or element-wise addition, are either computationally redundant or inflexible in modulation. FiLM, though proven in vision and speech, is extended by EiLM to permit element-wise, temporally aligned modulation:

$h^m_i = \text{EiLM}(h_i|c) = \gamma_i \odot h_i + \beta_i,$

where $(\gamma_i,\beta_i) = f_i(c)$ and $f_i$ is a linear projector matching the hidden state dimensionality.

Unlike FiLM, which only modulates features globally, EiLM can precisely align the melody condition temporally with hidden states, obviating the need for sequence partitioning or indirect alignment mechanisms.

Figure 1: EiLM enables direct temporal alignment and flexible modulation, contrasting with the rigidity and complexity of prior addition and cross-attention-based mechanisms.

IACR further refines the conditional representation by facilitating active interaction between melody conditions and hidden states, using a gating mechanism inspired by WaveNet. This enables dynamic adaptation, resolving the underconstrained optimization problem inherent in static conditional injection. The refined conditional vector

$c_i = \tanh(h'_i) \odot \tanh(m'_i),$

where both $h'_i$ and $m'_i$ are projected through linear layers, supports instance-tailored modulation parameters $(\gamma_{h,m}, \beta_{h,m}) = F(m,h)$ .

SongEcho Architecture and Integration

SongEcho extends the ACE-Step full-song generation model, employing a Diffusion Transformer (DiT) backbone. IA-EiLM is injected before each FFN layer in transformer blocks—leveraging local feature transformation for melody retention, unlike global self-attention which dilutes melodic signals. The melody encoder, built from 1D convolutions, aligns extracted pitch sequences with hidden states. Except for the IA-EiLM and melody encoder, all model parameters are frozen, maximizing parameter efficiency.

Figure 2: SongEcho pipeline showing DiT backbone, pitch extraction, melody encoding, and instance-adaptive conditioning through IA-EiLM modules integrated per transformer block.

Suno70k Dataset and Evaluation Protocol

Addressing data scarcity, Suno70k is constructed from a filtered subset of the Suno.ai Music Generation dataset. Rigorous curation involves metadata deduplication, quality assessment with SongEval, enhanced tagging via Qwen2-audio, and lyrics alignment consistent with ACE-Step requirements, yielding ~3,000 hours of high-quality AI-generated songs.

Melody control is objectively evaluated with Raw Pitch Accuracy (RPA), Raw Chroma Accuracy (RCA), Overall Accuracy (OA), and supported by audio similarity (FD $_{\text{openl3}}$ , KL $_{\text{passt}}$ ), CLAP score for text-audio alignment, and Phoneme Error Rate (PER) for vocal content fidelity.

Empirical Results and Strong Numerical Claims

SongEcho demonstrates markedly superior performance against ACE-Step+SA ControlNet and ACE-Step+MuseControlLite baselines:

RPA: 0.7080 (+14% absolute over nearest baseline).
RCA: 0.7339 (+9% absolute).
FD $_{\text{openl3}}$ : 42.06 (reducing baseline values by over 40%).
Trainable parameters: 49.1M (3.07% of ACE-Step+SA ControlNet).

Subjective evaluation via Mean Opinion Score confirms higher melody fidelity, text adherence, audio quality, and overall preference from both musician and non-musician listeners. Ablation studies validate the necessity of IACR and element-wise modulation, demonstrating sensitivity of melody metrics to module placement and training data scale.

Qualitative Analysis, Limitations, and Future Directions

SongEcho achieves high-quality cover generation with precise melody control, maintaining vocal-accompaniment coherence and minimizing lyric-melody misalignment. The model is robust to tag-melody conflicts and supports flexible inpainting/outpainting with simple masking. Global tempo and key can be manipulated via post-processing of the F0 sequence.

Figure 3: Visualization of F0 contour, word-level, and phoneme-level timestamps demonstrates accurate implicit alignment between melody and transcribed lyrics.

Figure 4: Example transcript output using Whisper, highlighting segmentation errors present in external datasets versus the clean alignment in Suno70k.

The scope for expressive, micro-level cover reinterpretation is currently constrained by the foundational model (ACE-Step), especially in vocal timbre manipulation. Incorporation of speaker encoders or paired original-cover datasets will permit finer-grained adaptation—potentially enabling emotional and technical nuances typical of professional human covers. Theoretical advances in conditional normalization, as applied here, offer broad utility for controllable generation tasks in music, speech, and beyond.

Figure 5: Attention map visualization of MuseControlLite under full-audio conditioning exposes its reliance on copying, incapable of flexible melody control as required by SongEcho.

Conclusion

SongEcho establishes a parameter-efficient, instance-adaptive modulation framework for cover song generation, advancing both conditional injection and representation with IA-EiLM. The method yields consistent improvements across melody control, audio quality, and text-music coherence, validated by objective and subjective metrics. The architecture and dataset curation mitigate longstanding copyright and data access issues for full-song AI generation. Broad implications include extensibility of instance-adaptive normalization to other conditional generative domains and practical approaches for high-fidelity musical reinterpretation in automated systems.

Markdown Report Issue