Papers
Topics
Authors
Recent
Search
2000 character limit reached

Single-Microphone Speaker Separation and Voice Activity Detection in Noisy and Reverberant Environments

Published 7 Jan 2024 in eess.AS and cs.SD | (2401.03448v1)

Abstract: Speech separation involves extracting an individual speaker's voice from a multi-speaker audio signal. The increasing complexity of real-world environments, where multiple speakers might converse simultaneously, underscores the importance of effective speech separation techniques. This work presents a single-microphone speaker separation network with TF attention aiming at noisy and reverberant environments. We dub this new architecture as Separation TF Attention Network (Sep-TFAnet). In addition, we present a variant of the separation network, dubbed $ \text{Sep-TFAnet}{\text{VAD}}$, which incorporates a voice activity detector (VAD) into the separation network. The separation module is based on a temporal convolutional network (TCN) backbone inspired by the Conv-Tasnet architecture with multiple modifications. Rather than a learned encoder and decoder, we use short-time Fourier transform (STFT) and inverse short-time Fourier transform (iSTFT) for the analysis and synthesis, respectively. Our system is specially developed for human-robotic interactions and should support online mode. The separation capabilities of $ \text{Sep-TFAnet}{\text{VAD}}$ and Sep-TFAnet were evaluated and extensively analyzed under several acoustic conditions, demonstrating their advantages over competing methods. Since separation networks trained on simulated data tend to perform poorly on real recordings, we also demonstrate the ability of the proposed scheme to better generalize to realistic examples recorded in our acoustic lab by a humanoid robot. Project page: https://Sep-TFAnet.github.io

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 31–35, 2016.
  2. D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 241–245, 2017.
  3. Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 27, no. 8, pp. 1256–1266, 2019.
  4. Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 46–50, 2020.
  5. N. Zeghidour and D. Grangier, “Wavesplit: End-to-end speech separation by speaker clustering,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2840–2849, 2021.
  6. S. Zhao and B. Ma, “Mossformer: Pushing the performance limit of monaural speech separation using gated single-head transformer with convolution-augmented joint self-attentions,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
  7. E. Nachmani, Y. Adi, and L. Wolf, “Voice separation with an unknown number of multiple speakers,” in International Conference on Machine Learning (ICML), pp. 7164–7175, 2020.
  8. S. Lutati, E. Nachmani, and L. Wolf, “SepIt approaching a single channel speech separation bound,” arXiv preprint arXiv:2205.11801, 2022.
  9. J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–half-baked or well done?,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 626–630, 2019.
  10. C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, “Attention is all you need in speech separation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 21–25, 2021.
  11. C. Subakan, M. Ravanelli, S. Cornell, F. Grondin, and M. Bronzi, “On using transformers for speech-separation,” arXiv preprint arXiv:2202.02884, 2022.
  12. E. Tzinis, Z. Wang, X. Jiang, and P. Smaragdis, “Compute and memory efficient universal sound source separation,” Journal of Signal Processing Systems, vol. 94, no. 2, pp. 245–259, 2022.
  13. G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn, D. Crow, E. Manilow, and J. L. Roux, “WHAM!: Extending Speech Separation to Noisy Environments,” in Proc. Interspeech, pp. 1368–1372, 2019.
  14. T. Cord-Landwehr, C. Boeddeker, T. Von Neumann, C. Zorilă, R. Doddipatla, and R. Haeb-Umbach, “Monaural source separation: From anechoic to reverberant environments,” in International Workshop on Acoustic Signal Enhancement (IWAENC), 2022.
  15. J. Heitkaemper, D. Jakobeit, C. Boeddeker, L. Drude, and R. Haeb-Umbach, “Demystifying TasNet: A dissecting approach,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6359–6363, 2020.
  16. D. Wang, T. Yoshioka, Z. Chen, X. Wang, T. Zhou, and Z. Meng, “Continuous speech separation with ad hoc microphone arrays,” in 2021 29th European Signal Processing Conference (EUSIPCO), pp. 1100–1104, IEEE, 2021.
  17. Q. Lin, L. Yang, X. Wang, L. Xie, C. Jia, and J. Wang, “Sparsely overlapped speech training in the time domain: Joint learning of target speech separation and personal vad benefits,” in 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 689–693, IEEE, 2021.
  18. M. Maciejewski, G. Wichern, E. McQuinn, and J. Le Roux, “WHAMR!: Noisy and reverberant single-channel speech separation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 696–700, 2020.
  19. C. Lea, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks: A unified approach to action segmentation,” in European Conference on Computer Vision (ECCV), (Amsterdam, The Netherlands), pp. 47–54, Springer, Oct. 2016.
  20. Z.-Q. Wang, G. Wichern, S. Watanabe, and J. Le Roux, “STFT-domain neural speech enhancement with very low algorithmic latency,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 397–410, 2022.
  21. W. Ravenscroft, S. Goetze, and T. Hain, “Deformable temporal convolutional networks for monaural noisy reverberant speech separation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
  22. Q. Zhang, X. Qian, Z. Ni, A. Nicolson, E. Ambikairajah, and H. Li, “A time-frequency attention module for neural speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 462–475, 2023.
  23. F. Liu, X. Ren, Z. Zhang, X. Sun, and Y. Zou, “Rethinking skip connection with layer normalization,” in Proceedings of the 28th international conference on computational linguistics, pp. 3586–3598, 2020.
  24. M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 10, pp. 1901–1913, 2017.
  25. Y. Yemini, E. Fetaya, H. Maron, and S. Gannot, “Scene-agnostic multi-microphone speech dereverberation,” in Proc. of Interspeech, (Brno, The Czech Republic), 2021.
  26. M. Delcroix, T. Yoshioka, A. Ogawa, Y. Kubo, M. Fujimoto, N. Ito, K. Kinoshita, M. Espi, T. Hori, T. Nakatani, et al., “Linear prediction-based dereverberation with advanced speech enhancement and recognition technologies for the reverb challenge,” in Reverb workshop, 2014.
  27. V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210, 2015.
  28. J. B. Allen and D. A. Berkley, “Image method for efficiently simulating small-room acoustics,” The Journal of the Acoustical Society of America, vol. 65, no. 4, pp. 943–950, 1979.
  29. E. A. Habets, “Room impulse response generator,” tech. rep., Friedrich-Alexander-Universität Erlangen-Nürnberg, 2014.
  30. E. Tzinis, Z. Wang, and P. Smaragdis, “Sudo rm-rf: Efficient networks for universal audio source separation,” in IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2020.
Citations (4)

Summary

  • The paper introduces Sep-TFAnet that integrates a TCN backbone with an embedded VAD module to separate mixed speech signals in challenging acoustic scenarios.
  • It utilizes SI-SDR loss with attention-enhanced 1-D AttConv blocks to robustly recover high-fidelity speech from reverberant and noisy inputs.
  • Experimental results demonstrate significant SI-SDR improvements and enhanced VAD performance, validating the model for real-time applications in robotics and telecommunication.

Single-Microphone Speaker Separation and Voice Activity Detection in Noisy and Reverberant Environments

Introduction

The paper "Single-Microphone Speaker Separation and Voice Activity Detection in Noisy and Reverberant Environments" (2401.03448) tackles the crucial task of speaker separation in complex acoustic environments using a single-microphone setup. This task is pivotal for applications in robot audition, speech recognition, and telecommunications. The authors introduce a novel architecture named Sep-TFAnet, which incorporates a temporal convolutional network (TCN) backbone, inspired by Conv-Tasnet, but with enhancements aimed at improving robustness in noisy and reverberant conditions.

Problem Formulation

The study focuses on the separation of mixed speech signals captured by a single microphone, described by an equation involving multiple speaker signals convolved with room impulse responses (RIR) and additive noise. The challenge lies in segregating these signals while maintaining high fidelity, despite the presence of reverberation and noise. The approach is grounded on the scale-invariant signal-to-distortion ratio (SI-SDR) to optimize the separation quality, leveraging a fully-supervised learning framework.

Proposed Model

The authors propose a comprehensive framework consisting of Sep-TFAnet, integrated with a voice activity detection (VAD) module. Key components include:

  1. Separation Module: Utilizes a TCN backbone with STFT/iSTFT for encoding and decoding the signals, capitalizing on the advantages of these transformations in reverberant settings. The processing module utilizes a stack of 1-D AttConv blocks that incorporate an attention layer to enhance performance in complex environments. The architecture is depicted in (Figure 1), highlighting the synergy between the network's learnable and data-driven components. Figure 1

    Figure 1: Sep-TFAnet architecture, integrating learnable and data-driven blocks for effective speaker separation.

  2. VAD Integration: The system concurrently operates a VAD network, which infers activity patterns of separated signals, offering potential benefits for downstream tasks like beamforming and localization. This integration is critical in applications requiring human-robot interactions, where accurate voice activity detection is essential for dialog management and environmental awareness.
  3. Online Mode: The model supports an online operating mode essential for real-time applications. It processes short overlapping segments to ensure low latency without substantial performance degradation, a crucial factor for interactive robot scenarios.
  4. Objective Functions: The adoption of SI-SDR loss and binary cross-entropy for separation and VAD tasks, respectively, facilitates the robust learning of both networks, ensuring high-quality outputs suitable for practical applications.

Experimental Results

The experimental analysis conducted on both simulated and real-world datasets illustrates the model's capabilities:

  1. Simulated Data: The model was tested against challenging simulated conditions, showing significant improvements in SI-SDR compared to baseline models such as SuDoRmRf and Conv-Tasnet. The use of realistic simulation parameters increased the challenge, underscoring the model's robustness. Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: SI-SDR vs. Gender, indicating performance across different speaker combinations.

  1. Real-World Data from ARI Robot: The model exhibited superior performance on data recorded in a controlled acoustic lab environment, highlighting its adaptability to real-world conditions. The results demonstrated substantial enhancements in SI-SDR and word error rates (WER) when applied to recordings from a humanoid robot setup (Figure 3). Figure 3

    Figure 3: Recording setup with ARI at BIU acoustic lab, showcasing the experiment's geometric layout.

  2. VAD Performance: The integrated VAD network outperformed conventional and state-of-the-art energy-based detection methods, emphasizing its effectiveness in noisy environments.

Conclusion

The research presented in this paper advances the field of single-microphone speaker separation by providing a robust framework suitable for both academic exploration and practical deployment. Sep-TFAnet and its VAD variant demonstrate enhanced performance in challenging scenarios, providing valuable insights into the integration of voice activity detection with separation networks. Future directions may include further optimization for low-memory environments and extending the model's capabilities to handle more diverse acoustic conditions, potentially improving its utility in augmented reality and advanced telecommunication systems.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.