Single-channel speech enhancement using learnable loss mixup
Abstract: Generalization remains a major problem in supervised learning of single-channel speech enhancement. In this work, we propose learnable loss mixup (LLM), a simple and effortless training diagram, to improve the generalization of deep learning-based speech enhancement models. Loss mixup, of which learnable loss mixup is a special variant, optimizes a mixture of the loss functions of random sample pairs to train a model on virtual training data constructed from these pairs of samples. In learnable loss mixup, by conditioning on the mixed data, the loss functions are mixed using a non-linear mixing function automatically learned via neural parameterization. Our experimental results on the VCTK benchmark show that learnable loss mixup achieves 3.26 PESQ, outperforming the state-of-the-art.
- S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 27, no. 2, pp. 113–120, 1929.
- J. S. Lim and A. V. Oppenheim, “All-pole modeling of degraded speech,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 26, no. 3, pp. 197–210, 1978.
- ——, “Enhancement and bandwidth compression of noisy speech,” Proceedings of the IEEE, vol. 67, no. 12, pp. 1586–1604, 1979.
- Y. Ephraimand and D. Malah, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 33, no. 2, pp. 443–445, 1985.
- S. Pascual, A. Bonafonte, and J. Serra, “Segan: Speech enhancement generative adversarial network,” Interspeech, 2017.
- D. S. Williamson and D. Wang, “Time-frequency masking in the complex domain for speech dereverberation and denoising,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 7, 2017.
- D. Rethage, J. Pons, and X. Serra, “A wavenet for speech denoising,” ICASSP, 2018.
- F. G. Germain, Q. Chen, and V. Koltun, “Speech denoising with deep feature losses,” Arxiv, 2018.
- M. H. Soni, N. Shah, and H. A. Patil, “Time-frequency masking- based speech enhancement using generative adversarial network,” ICASSP, 2018.
- K. Tan and D. Wang, “A convolutional recurrent neural network for real-time speech enhancement,” Interspeech, 2018.
- A. E. Bulut and K. Koishida, “Low-latency single channel speech enhancement using u-net convolutional neural networks,” ICASSP, 2020.
- O. Chapelle, J. Weston, L. Bottou, and V. Vapnik, “Vicinal risk minimization,” in NIPS, 2020.
- D. Tanaka, D. Ikami, T. Yamasaki, and K. Aizawa, “Joint optimization framework for learning with noisy labels,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5552–5560.
- C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Speech enhancement for a noise-robust text-to-speech synthesis system using deep recurrent neural networks.” 2016.
- H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412, 2017.
- C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.
- S. Thulasidasan, G. Chennupati, J. A. Bilmes, T. Bhattacharya, and S. Michalak, “On mixup training: Improved calibration and predictive uncertainty for deep neural networks,” in Advances in Neural Information Processing Systems, 2019, pp. 13 888–13 899.
- H. Guo, Y. Mao, and R. Zhang, “Mixup as locally linear out-of-manifold regularization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 3714–3722.
- V. Verma, A. Lamb, C. Beckham, A. Najafi, I. Mitliagkas, D. Lopez-Paz, and Y. Bengio, “Manifold mixup: Better representations by interpolating hidden states,” in International Conference on Machine Learning, 2019, pp. 6438–6447.
- J. Kim, M. El-Khamy, and J. Lee, “T-gsa: Transformer with gaussian-weighted self-attention for speech enhancement,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6649–6653.
- J. Abdulbaqi, Y. Gu, S. Chen, and I. Marsic, “Residual recurrent neural network for speech enhancement,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6659–6663.
- M. H. Soni, N. Shah, and H. A. Patil, “Time-frequency masking-based speech enhancement using generative adversarial network,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5039–5043.
- J. Yao and A. Al-Dahle, “Coarse-to-fine optimization for speech enhancement,” arXiv preprint arXiv:1908.08044, 2019.
- A. E. Bulut and K. Koishida, “Low-latency single channel speech enhancement using u-net convolutional neural networks,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6214–6218.
- D. Tran and K. Koishida, “Single-channel speech enhancement by subspace affinity minimization,” in INTERSPEECH 2020. IEEE, 2020.
- A. Defossez, G. Synnaeve, and Y. Adi, “Real time speech enhancement in the waveform domain,” arXiv preprint arXiv:2006.12847, 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.