Papers
Topics
Authors
Recent
Search
2000 character limit reached

SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline

Published 25 May 2025 in eess.AS, cs.AI, and cs.SD | (2505.19314v3)

Abstract: Target Speech Extraction (TSE) aims to isolate a target speaker's voice from a mixture of multiple speakers by leveraging speaker-specific cues, typically provided as auxiliary audio (a.k.a. cue audio). Although recent advancements in TSE have primarily employed discriminative models that offer high perceptual quality, these models often introduce unwanted artifacts, reduce naturalness, and are sensitive to discrepancies between training and testing environments. On the other hand, generative models for TSE lag in perceptual quality and intelligibility. To address these challenges, we present SoloSpeech, a novel cascaded generative pipeline that integrates compression, extraction, reconstruction, and correction processes. SoloSpeech features a speaker-embedding-free target extractor that utilizes conditional information from the cue audio's latent space, aligning it with the mixture audio's latent space to prevent mismatches. Evaluated on the widely-used Libri2Mix dataset, SoloSpeech achieves the new state-of-the-art intelligibility and quality in target speech extraction while demonstrating exceptional generalization on out-of-domain data and real-world scenarios.

Summary

  • The paper introduces a cascaded generative pipeline combining audio compression, extraction, and correction to achieve superior target speech extraction.
  • It leverages a variational autoencoder for effective compression and a latent diffusion model for accurate extraction, yielding high SI-SNR and DNSMOS scores.
  • The results confirm robust generalization on both in-domain and real-world datasets, setting a new benchmark in audio processing.

SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline

Introduction

Target Speech Extraction (TSE) is a sophisticated field of study aiming to isolate a desired speaker's voice amidst multiple speakers and background noise, leveraging speaker-specific cues. Traditional TSE methods, primarily employing discriminative models, often yield high perceptual quality but are susceptible to unwanted artifacts and environmental discrepancies. Generative models, while robust in unforeseen scenarios, historically lag in audio quality and intelligibility. The paper proposes SoloSpeech, a novel generative pipeline integrating compression, extraction, reconstruction, and correction processes, evaluated on the Libri2Mix dataset, setting a new standard for intelligibility and quality.

SoloSpeech Pipeline

SoloSpeech comprises three key components: a generative audio compressor, a generative target extractor, and a generative corrector.

The audio compressor employs a time-frequency domain variational autoencoder (VAE) to translate audio waveforms into latent representations, ensuring effective compression and reconstruction of speech signals (Figure 1). Figure 1

Figure 1: The audio compressor architecture.

The target extractor utilizes a latent diffusion model to predict the latent representation of the target signal. It operates without speaker embeddings, fusing the mixture and cue audio in a shared latent space using cross-attention mechanisms (Figure 2). Figure 2

Figure 2

Figure 2

Figure 2: Architectures of the target extractor (a), Diffusion Transformer backbone (b) and uDiT block (c).

The corrector refines the extracted audio by addressing artifacts introduced by the target extractor. Inspired by recent advances in generative error correction models, SoloSpeech integrates conditional features effectively, significantly enhancing overall signal quality and intelligibility. Figure 3

Figure 3: Overall pipeline of SoloSpeech.

Experimental Results

SoloSpeech was evaluated both on in-domain (Libri2Mix) and out-of-domain datasets, including real-world scenarios such as CHiME-5 and RealSEP. It consistently outperformed existing methods in terms of perceptual quality, naturalness, and intelligibility, evidenced by improved metrics such as PESQ, ESTOI, SI-SNR, and lower WER.

In-Domain Evaluation

On the Libri2Mix dataset, SoloSpeech surpassed conventional methods, yielding the highest SI-SNR and DNSMOS scores, indicative of superior audio clarity and speech naturalness. The pipeline demonstrated resilience in maintaining high intelligibility across various environmental conditions.

Out-of-Domain and Real-World Application

SoloSpeech exhibited robust generalization on out-of-domain data, significantly outperforming established methods in unseen conditions. Its application on real-world datasets demonstrated profound robustness to challenging acoustic scenarios, including expressive speech and moving sound sources. Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4

Figure 4: Comparison of the spectrograms of the ground truth, audio extracted by SoloSpeech, and by USEF-TSE. Rows: Sample I--V. Columns: (a,d,g,j,m) Ground truth, (b,e,h,k,n) audio extracted by SoloSpeech, (c,f,i,l,o) audio extracted by USEF-TSE.

Implications and Future Directions

SoloSpeech sets a new benchmark for generative models in TSE, combining high intelligibility and perceptual quality with robust generalization capabilities. The modular design of SoloSpeech components offers scalability, encouraging further exploration of generative architectures for audio processing tasks. Future research could focus on optimizing computational efficiency and addressing challenges in environments with significant reverberation and dynamic sound sources. Figure 5

Figure 5

Figure 5: Diagrams of Fast-GeCo corrector (a) and SoloSpeech corrector (b).

Conclusion

SoloSpeech's cascaded generative pipeline marks a significant advancement in target speech extraction by combining high intelligibility and perceptual quality with robust generalization capabilities. Its strategic integration of compression, extraction, and correction components underscores the potential of generative models to surpass traditional discriminative approaches, paving the way for future innovations in AI-driven audio processing.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 4 likes about this paper.

HackerNews