Single-Microphone Speaker Separation and Voice Activity Detection in Noisy and Reverberant Environments

Published 7 Jan 2024 in eess.AS and cs.SD | (2401.03448v1)

Abstract: Speech separation involves extracting an individual speaker's voice from a multi-speaker audio signal. The increasing complexity of real-world environments, where multiple speakers might converse simultaneously, underscores the importance of effective speech separation techniques. This work presents a single-microphone speaker separation network with TF attention aiming at noisy and reverberant environments. We dub this new architecture as Separation TF Attention Network (Sep-TFAnet). In addition, we present a variant of the separation network, dubbed $ \text{Sep-TFAnet}^{{\text{VAD}}$,} which incorporates a voice activity detector (VAD) into the separation network. The separation module is based on a temporal convolutional network (TCN) backbone inspired by the Conv-Tasnet architecture with multiple modifications. Rather than a learned encoder and decoder, we use short-time Fourier transform (STFT) and inverse short-time Fourier transform (iSTFT) for the analysis and synthesis, respectively. Our system is specially developed for human-robotic interactions and should support online mode. The separation capabilities of $ \text{Sep-TFAnet}^{{\text{VAD}}$} and Sep-TFAnet were evaluated and extensively analyzed under several acoustic conditions, demonstrating their advantages over competing methods. Since separation networks trained on simulated data tend to perform poorly on real recordings, we also demonstrate the ability of the proposed scheme to better generalize to realistic examples recorded in our acoustic lab by a humanoid robot. Project page: https://Sep-TFAnet.github.io

Abstract PDF HTML Upgrade to Chat

References (30)

Citations (4)

View on Semantic Scholar

Summary

The paper introduces Sep-TFAnet that integrates a TCN backbone with an embedded VAD module to separate mixed speech signals in challenging acoustic scenarios.
It utilizes SI-SDR loss with attention-enhanced 1-D AttConv blocks to robustly recover high-fidelity speech from reverberant and noisy inputs.
Experimental results demonstrate significant SI-SDR improvements and enhanced VAD performance, validating the model for real-time applications in robotics and telecommunication.

Single-Microphone Speaker Separation and Voice Activity Detection in Noisy and Reverberant Environments

Introduction

The paper "Single-Microphone Speaker Separation and Voice Activity Detection in Noisy and Reverberant Environments" (2401.03448) tackles the crucial task of speaker separation in complex acoustic environments using a single-microphone setup. This task is pivotal for applications in robot audition, speech recognition, and telecommunications. The authors introduce a novel architecture named Sep-TFAnet, which incorporates a temporal convolutional network (TCN) backbone, inspired by Conv-Tasnet, but with enhancements aimed at improving robustness in noisy and reverberant conditions.

Problem Formulation

The study focuses on the separation of mixed speech signals captured by a single microphone, described by an equation involving multiple speaker signals convolved with room impulse responses (RIR) and additive noise. The challenge lies in segregating these signals while maintaining high fidelity, despite the presence of reverberation and noise. The approach is grounded on the scale-invariant signal-to-distortion ratio (SI-SDR) to optimize the separation quality, leveraging a fully-supervised learning framework.

Proposed Model

The authors propose a comprehensive framework consisting of Sep-TFAnet, integrated with a voice activity detection (VAD) module. Key components include:

Separation Module: Utilizes a TCN backbone with STFT/iSTFT for encoding and decoding the signals, capitalizing on the advantages of these transformations in reverberant settings. The processing module utilizes a stack of 1-D AttConv blocks that incorporate an attention layer to enhance performance in complex environments. The architecture is depicted in (Figure 1), highlighting the synergy between the network's learnable and data-driven components.
Figure 1: Sep-TFAnet architecture, integrating learnable and data-driven blocks for effective speaker separation.
VAD Integration: The system concurrently operates a VAD network, which infers activity patterns of separated signals, offering potential benefits for downstream tasks like beamforming and localization. This integration is critical in applications requiring human-robot interactions, where accurate voice activity detection is essential for dialog management and environmental awareness.
Online Mode: The model supports an online operating mode essential for real-time applications. It processes short overlapping segments to ensure low latency without substantial performance degradation, a crucial factor for interactive robot scenarios.
Objective Functions: The adoption of SI-SDR loss and binary cross-entropy for separation and VAD tasks, respectively, facilitates the robust learning of both networks, ensuring high-quality outputs suitable for practical applications.

Experimental Results

The experimental analysis conducted on both simulated and real-world datasets illustrates the model's capabilities:

Simulated Data: The model was tested against challenging simulated conditions, showing significant improvements in SI-SDR compared to baseline models such as SuDoRmRf and Conv-Tasnet. The use of realistic simulation parameters increased the challenge, underscoring the model's robustness.

Figure 2: SI-SDR vs. Gender, indicating performance across different speaker combinations.

Real-World Data from ARI Robot: The model exhibited superior performance on data recorded in a controlled acoustic lab environment, highlighting its adaptability to real-world conditions. The results demonstrated substantial enhancements in SI-SDR and word error rates (WER) when applied to recordings from a humanoid robot setup (Figure 3).
Figure 3: Recording setup with ARI at BIU acoustic lab, showcasing the experiment's geometric layout.
VAD Performance: The integrated VAD network outperformed conventional and state-of-the-art energy-based detection methods, emphasizing its effectiveness in noisy environments.

Conclusion

The research presented in this paper advances the field of single-microphone speaker separation by providing a robust framework suitable for both academic exploration and practical deployment. Sep-TFAnet and its VAD variant demonstrate enhanced performance in challenging scenarios, providing valuable insights into the integration of voice activity detection with separation networks. Future directions may include further optimization for low-memory environments and extending the model's capabilities to handle more diverse acoustic conditions, potentially improving its utility in augmented reality and advanced telecommunication systems.