Unsupervised Speech Decomposition via Triple Information Bottleneck

Published 23 Apr 2020 in eess.AS, cs.LG, and cs.SD | (2004.11284v6)

Abstract: Speech information can be roughly decomposed into four components: language content, timbre, pitch, and rhythm. Obtaining disentangled representations of these components is useful in many speech analysis and generation applications. Recently, state-of-the-art voice conversion systems have led to speech representations that can disentangle speaker-dependent and independent information. However, these systems can only disentangle timbre, while information about pitch, rhythm and content is still mixed together. Further disentangling the remaining speech components is an under-determined problem in the absence of explicit annotations for each component, which are difficult and expensive to obtain. In this paper, we propose SpeechSplit, which can blindly decompose speech into its four components by introducing three carefully designed information bottlenecks. SpeechSplit is among the first algorithms that can separately perform style transfer on timbre, pitch and rhythm without text labels. Our code is publicly available at https://github.com/auspicious3000/SpeechSplit.

Abstract PDF Upgrade to Chat

Citations (174)

View on Semantic Scholar

Summary

The paper introduces SpeechSplit, a method that decomposes speech into content, timbre, pitch, and rhythm via a triple information bottleneck.
It employs three specialized encoders to isolate language, pitch, and rhythm features, surpassing conventional timbre-focused models.
Empirical results and human evaluations highlight its potential for expressive TTS, emotional speech synthesis, and advanced voice conversion.

Unsupervised Speech Decomposition via Triple Information Bottleneck: An Expert Review

This paper introduces SpeechSplit, an innovative unsupervised speech decomposition framework designed to disentangle speech into its primary components: content, timbre, pitch, and rhythm. This methodology addresses a significant limitation in existing voice conversion systems, which primarily focus on timbre, leaving other components intermingled. The authors achieve decomposition using a novel triple information bottleneck mechanism, facilitating style transfer on each component without requiring text labels.

Problem Context and Motivation

Speech is inherently complex, composed of entwined elements like language content, timbre, pitch, and rhythm. Traditional voice conversion systems have made strides in separating speaker-independent and dependent features but remain limited to timbre disentanglement. Solving the problem of unsupervised decomposition could significantly enhance tasks such as prosody modification and emotional speech synthesis, where separate control over individual speech elements is beneficial.

Methodology: SpeechSplit Framework

SpeechSplit utilizes an encoder-decoder architecture incorporating three distinct encoders, each designed to target specific speech components. These encoders create an information bottleneck through:

Content Encoder: Focused on language content, employing random resampling to obscure rhythm information.
Rhythm Encoder: Aims to capture rhythm directly from the speech signal.
Pitch Encoder: Analyzes normalized pitch contours to isolate pitch features without text transcription.

The key innovation here lies in how SpeechSplit enforces these information bottlenecks, particularly through random resampling, which differentially impacts rhythm information across encoders.

Theoretical Foundation and Assumptions

The authors present a theoretical framework built on information theory principles, suggesting that encoders prioritize passing information that cannot be sourced elsewhere in the pipeline. Key assumptions include the independence of the speech components and strict constraints on bottleneck dimensions, ensuring each encoder specializes in distinct speech features.

Empirical Results

The paper provides substantial empirical evidence demonstrating the effective disentanglement capabilities of SpeechSplit. In tests with parallel speech pairs, the system achieved targeted conversions for rhythm, pitch, and timbre individually and in combination, significantly outperforming conventional models like AutoVC, which is limited to timbre conversion. Subjective evaluations show that SpeechSplit can independently modify each component reliably, verified by human listening tests and objective metrics such as GPE, VDE, and FFE for pitch accuracy.

Implications and Future Directions

The implications of this work are twofold:

Practical Applications: SpeechSplit’s ability to flexibly convert different speech features opens new possibilities in expressive text-to-speech systems, emotional speech synthesis, and perhaps more realistically, low-resource language processing.
Theoretical Insights: The findings offer a new perspective on neural network information processing, highlighting that under constraint, networks favor the transmission of information uniquely unavailable through other channels.

Future work could refine bottleneck designs using advanced information-theoretic approaches, potentially enhancing disentanglement precision. The adaptation of this framework to other domains of machine learning that require disentangled representations, such as image or video analysis, also presents an intriguing avenue for research.

In conclusion, SpeechSplit stands as a methodologically robust contribution to the field of unsupervised learning in speech processing, providing a template that could inspire further innovations in disentangled representation learning.

Markdown Report Issue