Effective Noise-aware Data Simulation for Domain-adaptive Speech Enhancement Leveraging Dynamic Stochastic Perturbation
Abstract: Cross-domain speech enhancement (SE) is often faced with severe challenges due to the scarcity of noise and background information in an unseen target domain, leading to a mismatch between training and test conditions. This study puts forward a novel data simulation method to address this issue, leveraging noise-extractive techniques and generative adversarial networks (GANs) with only limited target noisy speech data. Notably, our method employs a noise encoder to extract noise embeddings from target-domain data. These embeddings aptly guide the generator to synthesize utterances acoustically fitted to the target domain while authentically preserving the phonetic content of the input clean speech. Furthermore, we introduce the notion of dynamic stochastic perturbation, which can inject controlled perturbations into the noise embeddings during inference, thereby enabling the model to generalize well to unseen noise conditions. Experiments on the VoiceBank-DEMAND benchmark dataset demonstrate that our domain-adaptive SE method outperforms an existing strong baseline based on data simulation.
- “A fully convolutional neural network for speech enhancement,” 2016, arXiv:1609.07132.
- “Gated residual networks with dilated convolutions for monaural speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 1, pp. 189–198, 2019.
- “Generative adversarial networks based data augmentation for noise robust speech recognition,” in Proc. ICASSP, 2018.
- “Convolutional-recurrent neural networks for speech enhancement,” in Proc. ICASSP, 2018.
- “Efficient transformer-based speech enhancement using long frames and STFT magnitudes,” in Proc. Interspeech, 2022.
- “MetricGAN: Generative adversarial networks based black-box metric scores optimization for speech enhancement,” in Proc. ICML, 2019.
- “A parallel-data-free speech enhancement method using multi-objective learning cycle-consistent generative adversarial network,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1826–1838, 2020.
- “CMGAN: Conformer-based metric GAN for speech enhancement,” in Proc. Interspeech, 2022.
- “SCP-GAN: Self-correcting discriminator optimization for training consistency preserving metric GAN on speech enhancement tasks,” 2022, arXiv:2210.14474.
- “Noise adaptive speech enhancement using domain adversarial training,” in Proc. Interspeech, 2019.
- “Transfer learning for speech and language processing,” in Proc. APSIPA ASC, 2015.
- “Unsupervised domain adaptation with residual transfer networks,” in Proc. NIPS, 2016.
- “Adversarial discriminative domain adaptation,” in Proc. CVPR, 2017.
- “A DIRT-T approach to unsupervised domain adaptation,” in Proc. ICLR, 2019.
- “Learning disentangled feature representations for speech enhancement via adversarial training,” in Proc. ICASSP, 2021.
- “Unsupervised noise adaptation using data simulation,” in Proc. ICASSP, 2023.
- “Noise-robust speech recognition with 10 minutes unparalleled in-domain data,” in Proc. ICASSP, 2022.
- “StarGANv2-VC: A diverse, unsupervised, non-parallel framework for natural-sounding voice conversion,” in Proc. Interspeech, 2021.
- “Noise-aware speech enhancement using diffusion probabilistic model,” in Proc. Interspeech, 2024.
- “BEATs: Audio pre-training with acoustic tokenizers,” 2022, arXiv:2212.09058.
- “Audio Set: An ontology and human-labeled dataset for audio events,” in Proc. ICASSP, 2017.
- “Investigating RNN-based speech enhancement methods for noise-robust text-to-speech,” in Proc. SSW, 2016.
- “FiLM: Visual reasoning with a general conditioning layer,” in Proc. AAAI, 2018.
- “Contrastive learning for unpaired image-to-image translation,” in Proc. ECCV, 2020.
- “The diverse environments multi-channel acoustic noise database (DEMAND): A database of multichannel environmental noise recordings,” The Journal of the Acoustical Society of America, vol. 133, 2013.
- “Real time speech enhancement in the waveform domain,” in Proc. Interspeech, 2020.
- “Adam: A method for stochastic optimization,” in Proc. ICLR, 2015.
- “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. ICASSP, 2001.
- “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011.
- Laurens van der Maaten and Geoffrey Hinton, “Visualizing data using t-SNE,” Journal of Machine Learning Research, vol. 9, no. 86, pp. 2579–2605, 2008.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.