Continuous Speech Synthesis using per-token Latent Diffusion

Published 21 Oct 2024 in eess.AS | (2410.16048v1)

Abstract: The success of autoregressive transformer models with discrete tokens has inspired quantization-based approaches for continuous modalities, though these often limit reconstruction quality. We therefore introduce SALAD, a per-token latent diffusion model for zero-shot text-to-speech, that operates on continuous representations. SALAD builds upon the recently proposed expressive diffusion head for image generation, and extends it to generate variable-length outputs. Our approach utilizes semantic tokens for providing contextual information and determining the stopping condition. We suggest three continuous variants for our method, extending popular discrete speech synthesis techniques. Additionally, we implement discrete baselines for each variant and conduct a comparative analysis of discrete versus continuous speech modeling techniques. Our results demonstrate that both continuous and discrete approaches are highly competent, and that SALAD achieves a superior intelligibility score while obtaining speech quality and speaker similarity on par with the ground-truth audio.

Abstract PDF HTML Upgrade to Chat

Summary

The paper presents SALAD, a model that uses per-token latent diffusion to perform zero-shot text-to-speech synthesis with improved intelligibility.
The method integrates semantic and acoustic token prediction in both two-stage (T2S and S2A) and end-to-end (T2A) frameworks, ensuring high quality synthesis.
Future work is aimed at refining multimodal integration and adaptive diffusion strategies to further boost performance in continuous speech synthesis.

Continuous Speech Synthesis using Per-Token Latent Diffusion

Introduction

The paper introduces SALAD, a model innovatively designed for zero-shot text-to-speech (TTS) synthesis leveraging per-token latent diffusion in continuous modalities. This synthesis method aims to outperform traditional autoregressive (AR) modeling techniques that rely on discrete representations. By integrating semantic tokens with a continuous diffusion model, SALAD addresses critical challenges in speech synthesis, including variable output length and enhanced intelligibility without sacrificing audio quality or speaker similarity.

Figure 1: Continuous vs. discrete modeling.

Methodology

Per-Token Latent Diffusion Model

SALAD employs per-token latent diffusion which adapts continuous diffusion models like those used in image generation to synthesize speech. Unlike conventional diffusion models that concurrently denoise complete sequences, this model denoises independently, enabling variable-length speech generation. The diffusion head refines predictions through contextual token vectors derived from semantic information and speaker prompts.

Figure 2: Training process of the diffusion head.

SALAD Variants

Semantic to Acoustic (S2A): Operates through two sub-models: a text-to-semantic (T2S) and a semantic-to-acoustic (S2A) model. The T2S model predicts semantic tokens from text, which guides the S2A model to predict acoustic tokens, employing either AR or non-AR techniques powered by MaskGIT.
Text to Acoustic (T2A): An end-to-end model bypassing separate semantic predictions, integrating semantic tokens as an auxiliary for contextual relevance and an inferred stopping condition.
Figure 3: Synthesis using Semantic-to-Acoustic models.

Figure 4: Synthesis using Text-to-Acoustic models.

Experimental Evaluation

Objective Metrics

Evaluations on the LibriSpeech test-clean dataset highlight that SALAD's continuous approaches significantly enhance intelligibility scores while maintaining competitive audio quality. Particularly, the T2A model achieves top intelligibility scores with speaker similarity marginally behind its discrete counterparts.

Subjective Listening Tests

Subjective assessments reveal parity between continuous T2A and high-quality audio outputs, suggesting SALAD's ability to replicate natural speech qualities effectively. The speaker similarity tests reaffirm the model's capability to convincingly mimic target speaker prompts.

Figure 5: MOS Listening Test (1-5 scale).

Discussion and Future Prospects

The study underscores the effectiveness of continuous latent diffusion in overcoming limitations posed by discrete token quantization, specifically in preserving reconstruction fidelity. Future advances should explore refined multimodal models that harmoniously blend perceptual tasks with generative capabilities without extensive compression, potentially through further optimization of continuous tokens and adaptive diffusion strategies.

Conclusion

The paper provides compelling evidence for continuous per-token diffusion's potential in advancing speech synthesis. By bridging semantic tokens with a diffusion-based approach, SALAD not only enhances traditional speech synthesis metrics but also sets a foundational methodology for subsequent multimodal and TTS advancements. Future work could extend these findings into comprehensive frameworks that integrate more nuanced perceptual insights within speech models.

Figure 6: CFG scale.

Figure 7: VQ/VAE reconstruction.

This analysis conveys detailed insights into the research and its practical implications, emphasizing continuous token diffusion's benefits in speech synthesis and laying the groundwork for further AI advancements in multimodal systems.