S²Voice: Style-Aware Singing Conversion

Updated 21 January 2026

S²Voice is a style-aware autoregressive system that enhances singing voice conversion with fine-grained style modeling and robust timbre conditioning.
Its two-stage architecture integrates an autoregressive transformer and a flow-matching model, using FiLM-style conditioning, cross-attention, and global speaker embeddings.
The system leverages large-scale curated training data and advanced strategies like supervised fine-tuning and direct preference optimization to achieve state-of-the-art SVCC results.

S $^2$ Voice is a style-aware autoregressive modeling system with enhanced conditioning specifically designed for high-fidelity singing voice conversion tasks. Distinguished as the winning entry in the Singing Voice Conversion Challenge (SVCC) 2025 for both in-domain and zero-shot tracks, S $^2$ Voice extends the established two-stage Vevo baseline by incorporating fine-grained style modeling, robust timbre conditioning, large-scale curated training data, and advanced training methodologies including supervised fine-tuning and direct preference optimization (Wang et al., 20 Jan 2026).

1. System Architecture and Methodological Advances

S $^2$ Voice is structured as a two-stage system. The first stage employs an autoregressive LLM (AR LLM) transformer that maps compressed content tokens (from a Whisper-based tokenizer with duration reduction) to a sequence of “content + style” tokens. The model is optimized via next-token cross-entropy. The second stage is a flow-matching transformer that reconstructs mel-spectrograms from these tokens, conditioned on a timbre reference, using a mean-squared vector-field loss.

Major architectural innovations introduced by S $^2$ Voice include:

FiLM-style Layer-Norm Conditioning: A global style encoder produces style embeddings that, through FiLM-like modulation, inject a style-dependent scale and shift into the layer normalization in each transformer block. For each layer $\ell$ , the FiLM-LN is parameterized as:

$\gamma_\ell = W_\ell^\gamma\,\mathrm{Proj}(E_s) + b_\ell^\gamma, \quad \beta_\ell = W_\ell^\beta\,\mathrm{Proj}(E_s) + b_\ell^\beta$

$\mathrm{FiLM\text{-}LN}_\ell(h_i^\ell) = (1+\gamma_\ell) \odot \mathrm{LN}(h_i^\ell) + \beta_\ell$

Style-Aware Cross-Attention: Each AR transformer block includes a cross-attention mechanism in which style embeddings serve as queries and the model’s hidden states as keys/values. This enables time-local, fine-grained style conditioning:

$Q = W_Q E_s, \quad K = W_K H, \quad V = W_V H$

$\text{CrossAttn}(E_s,H) = \mathrm{softmax}(QK^\top/\sqrt{d})V$

The resulting context vectors modulate the AR state per layer.

Global Speaker Embedding: A pre-trained speaker verification network extracts a latent embedding $s_g$ from the reference speaker waveform, which is provided as global conditioning to the flow-matching model:

$y \sim p_\phi(y \mid \tilde{z}_s, s_g)$

and

$\mathcal{L}_{\text{flow}} = \mathbb{E}_{y,\tau} \left[ \|v_\phi(y,\tau;\tilde{z}_s,s_g) - v^*(y,\tau)\|^2 \right]$

This preserves singer timbre while decoupling style and content.

2. Data Curation and Preprocessing Pipeline

S $^2$ Voice’s data pipeline is fully automated and designed to maximize coverage, quality, and diversity:

Web Harvesting: Crawls publicly available singing tracks to amass a large corpus.
Vocal Separation: Utilizes a pre-trained band-split Rope Transformer to separate vocals from mixtures, ensuring high-quality vocal stems.
ASR Fusion and Transcript Refinement: Applies multiple ASR systems (Whisper, Paraformer), computes per-token confidence and agreement, and uses an LLM (Qwen-3) with music-aware prompting to normalize lyrics.
Quality Filtering: Employs DNSMOS P.835 for perceptual speech quality, and imposes energy, pitch stability, and noise ratio thresholds.
Deduplication and Balancing: Balances for gender, style, and language, resulting in a dataset of approximately 500 hours covering English, Chinese, pop, rock, classical, and other genres (Wang et al., 20 Jan 2026).

3. Training Strategies

S $^2$ Voice integrates both supervised and preference-based training regimes.

Supervised Fine-Tuning (SFT):
- AR LLM is optimized on next-token negative log likelihood with a learning rate of $2\times10^{-5}$ .
- Flow-matching model uses mean-squared vector-field loss with a learning rate of $7\times10^{-6}$ .
- Default batch sizes from Vevo (e.g., 64 sequences); no specific curriculum learning.
Direct Preference Optimization (DPO):
- Only AR LLM parameters are updated.
- Human-annotated bad-case preference pairs $(x_{\text{pos}}, x_{\text{neg}})$ are used.
- The loss is
$\mathcal{L}_{\text{pref}} = -\log \frac{\exp(s_\theta(x_\text{pos}))} {\exp(s_\theta(x_\text{pos})) + \exp(s_\theta(x_\text{neg}))}$ - Learning rate for DPO is $1\times10^{-6}$ , with AdamW optimizer and same weight decay as SFT.

This multi-stage strategy aims to combine high average quality with robustness against specific classes of errors and instability.

4. Evaluation Protocols and Ablation Studies

Evaluation is conducted using Mean Opinion Score (MOS) for naturalness (5-point scale), style similarity, and singer similarity (AB/XAB subjective protocols, agreement percent with reference). The following summarizes principal results on SVCC 2025 tasks (Wang et al., 20 Jan 2026):

System	Task	Naturalness (MOS)	Style Similarity (%)	Singer Similarity (%)
GT	1	3.90 ± 0.15	79 ± 3	63 ± 4
Vevo	1	3.10 ± 0.12	30 ± 5	42 ± 5
S $^2$ Voice	1	3.30 ± 0.10	59 ± 4	57 ± 4
GT	2	4.10 ± 0.15	78 ± 3	60 ± 4
Vevo	2	3.20 ± 0.12	32 ± 5	52 ± 5
S $^2$ Voice	2	3.75 ± 0.11	70 ± 3	59 ± 4

Component ablations (on zero-shot task) show that adding FiLM-style conditioning and cross-attention steadily increases style similarity, while global speaker embedding offers substantial timbre preservation gains. DPO slightly reduces average MOS/style metrics but enhances system stability.

Variant	Naturalness (MOS)	Style Sim.	Singer Sim.
SFT Only	3.50 ± 0.12	62 ± 4	52 ± 5
+ FiLM	3.62 ± 0.11	65 ± 4	54 ± 4
+ Cross-Attention	3.68 ± 0.11	68 ± 3	56 ± 4
+ Global Spk. Emb.	3.75 ± 0.11	70 ± 3	59 ± 4
+ DPO	3.72 ± 0.11	69 ± 3	58 ± 4

5. Comparative Perspective: SingIt!/AutoVC Baseline

A related approach is articulated in "SingIt!" (Eliav et al., 2024), which, while using the shorthand S²Voice in some contexts, follows a zero-shot auto-encoder paradigm. The system processes a song mix with Spleeter to extract vocals, encodes singer style via Resemblyzer embeddings, and models content with a convolutional-BLSTM stack. The decoder receives concatenated content and style, producing a preliminary spectrogram, which is refined by a CNN-based post-net and reconstructed into time-domain audio with Griffin-Lim. Training loss combines reconstruction MSE and latent consistency at a high weight. "SingIt!" is trained on the NHSS corpus (100 songs, 10 singers) and, in subjective tests, yields moderate results for word clarity, melody fidelity, and target style impression.

This provides context for S $^2$ Voice’s advances in conditional modeling, large-scale training, and data curation, which collectively yield higher style and singer similarity on the SVCC benchmarks.

6. Audio Samples and Practical Usage

Demonstrations of the S $^2$ Voice system, including style conversion between pop and rock references and zero-shot conversion on previously unseen singers, are accessible at [https://honee-w.github.io/SVC-Challenge-Demo/]. These examples highlight the ability to imprint both fine-grained stylistic detail and preserve singer timbre under diverse input conditions (Wang et al., 20 Jan 2026).

7. Impact and Significance in Singing Voice Conversion

S $^2$ Voice introduces a state-of-the-art, style-aware, and robust modeling pipeline for singing voice conversion, with strong empirical evidence for improved naturalness, style transfer, and timbre preservation. Innovations include global and local style modulation, explicit speaker disentanglement during acoustic modeling, and the construction of a comprehensive, balanced singing voice corpus. The system’s architecture and methodologies provide a foundation for subsequent research on high-quality, low-resource, and zero-shot singing style conversion across linguistic and musical contexts (Wang et al., 20 Jan 2026, Eliav et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

S$^2$Voice: Style-Aware Autoregressive Modeling with Enhanced Conditioning for Singing Style Conversion (2026)

SingIt! Singer Voice Transformation (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to S$^2$Voice.