GenSE: Generative Speech Enhancement via Language Models using Hierarchical Modeling

Published 5 Feb 2025 in eess.AS and cs.SD | (2502.02942v1)

Abstract: Semantic information refers to the meaning conveyed through words, phrases, and contextual relationships within a given linguistic structure. Humans can leverage semantic information, such as familiar linguistic patterns and contextual cues, to reconstruct incomplete or masked speech signals in noisy environments. However, existing speech enhancement (SE) approaches often overlook the rich semantic information embedded in speech, which is crucial for improving intelligibility, speaker consistency, and overall quality of enhanced speech signals. To enrich the SE model with semantic information, we employ LLMs as an efficient semantic learner and propose a comprehensive framework tailored for LLM-based speech enhancement, called \textit{GenSE}. Specifically, we approach SE as a conditional language modeling task rather than a continuous signal regression problem defined in existing works. This is achieved by tokenizing speech signals into semantic tokens using a pre-trained self-supervised model and into acoustic tokens using a custom-designed single-quantizer neural codec model. To improve the stability of LLM predictions, we propose a hierarchical modeling method that decouples the generation of clean semantic tokens and clean acoustic tokens into two distinct stages. Moreover, we introduce a token chain prompting mechanism during the acoustic token generation stage to ensure timbre consistency throughout the speech enhancement process. Experimental results on benchmark datasets demonstrate that our proposed approach outperforms state-of-the-art SE systems in terms of speech quality and generalization capability.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a hierarchical modeling approach that treats speech enhancement as a conditional language modeling task using semantic and acoustic tokens.
The methodology employs a two-step process—noise-to-semantic transformation and semantic-to-speech generation—using XLSR and SimCodec to boost performance.
Experimental results show GenSE outperforms benchmarks with superior PESQ, MOS, and DNSMOS scores, ensuring naturalness and speaker similarity across noisy conditions.

GenSE: Generative Speech Enhancement via LLMs using Hierarchical Modeling

Introduction

GenSE introduces an innovative approach to speech enhancement by leveraging LMs for semantic representation of speech signals. The framework addresses limitations in traditional methods which often neglect semantic information critical for achieving high-quality speech, especially under challenging noise conditions. GenSE reframes the SE task as a conditional language modeling problem, employing semantic and acoustic tokens derived from a self-supervised speech model and a custom neural codec, respectively.

Figure 1: The hierarchical modeling framework of LLM in GenSE.

Framework Design

Semantic and Acoustic Tokenization

GenSE utilizes semantic tokens to capture high-level linguistic information through a pre-trained model, XLSR, and acoustic tokens using SimCodec, a novel codec model designed to reduce token prediction complexity. The tokenization process is key in bridging continuous speech and LMs, allowing for enhanced modeling capabilities beyond mere noise suppression.

Hierarchical Modeling

Implemented through a two-step process, the hierarchical modeling segregates token generation into distinct stages: noise-to-semantic (N2S) transformation and semantic-to-speech (S2S) generation. This separation helps ensure that noise does not adversely affect semantic token accuracy, ultimately leading to better acoustic token prediction and enriched final speech output.

Figure 2: The detailed architecture and training process of SimCodec, with the reorganization process of the group quantizer highlighted in the red dashed block.

SimCodec and Quantization Process

SimCodec reduces computational complexity by using a single quantizer with an expanded codebook, facilitated by a unique reorganization process. This approach carefully balances the number of tokens and reconstruction quality, ensuring efficient LM prediction while maintaining audio fidelity.

Reorganization Process: By selecting the most used tokens from an initial multi-stage quantization sequence, SimCodec reorganizes these into an augmented codebook space, substantially improving codebook utilization and enhancing performance metrics like PESQ and MOS scores.

Experimental Results

GenSE's performance was evaluated against several state-of-the-art SE systems using objective metrics such as DNSMOS and SECS, alongside subjective assessments for speech naturalness and speaker similarity. The framework consistently outperformed baselines, particularly demonstrating robustness in unseen acoustic environments like the CHiME-4 dataset.

Figure 3: Violin plots for speech naturalness and speaker similarity, comparing the signals enhanced by baseline systems and GenSE.

Objective Metrics: GenSE notably improved scores across both reverb and non-reverb conditions, with substantial gains in OVL and SECS metrics emphasizing its ability to preserve speech quality and speaker identity.

Subjective Metrics: Violin plots confirmed enhanced perceptual qualities of GenSE over baselines, with median scores for naturalness and similarity showcasing its superior enhancement capabilities.

Conclusion

GenSE stands out as a robust framework kindled by hierarchical modeling and efficient tokenization strategies, achieving unprecedented results in SE tasks. The integration of SimCodec significantly reduces token prediction complexity while preserving semantic and acoustic attributes vital for high-quality speech production. Future research will explore model scalability and real-time processing optimizations to further enhance applicability in diverse acoustic settings.