Papers
Topics
Authors
Recent
Search
2000 character limit reached

Single-channel speech enhancement using learnable loss mixup

Published 20 Dec 2023 in eess.AS, cs.LG, and cs.SD | (2312.17255v1)

Abstract: Generalization remains a major problem in supervised learning of single-channel speech enhancement. In this work, we propose learnable loss mixup (LLM), a simple and effortless training diagram, to improve the generalization of deep learning-based speech enhancement models. Loss mixup, of which learnable loss mixup is a special variant, optimizes a mixture of the loss functions of random sample pairs to train a model on virtual training data constructed from these pairs of samples. In learnable loss mixup, by conditioning on the mixed data, the loss functions are mixed using a non-linear mixing function automatically learned via neural parameterization. Our experimental results on the VCTK benchmark show that learnable loss mixup achieves 3.26 PESQ, outperforming the state-of-the-art.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 27, no. 2, pp. 113–120, 1929.
  2. J. S. Lim and A. V. Oppenheim, “All-pole modeling of degraded speech,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 26, no. 3, pp. 197–210, 1978.
  3. ——, “Enhancement and bandwidth compression of noisy speech,” Proceedings of the IEEE, vol. 67, no. 12, pp. 1586–1604, 1979.
  4. Y. Ephraimand and D. Malah, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 33, no. 2, pp. 443–445, 1985.
  5. S. Pascual, A. Bonafonte, and J. Serra, “Segan: Speech enhancement generative adversarial network,” Interspeech, 2017.
  6. D. S. Williamson and D. Wang, “Time-frequency masking in the complex domain for speech dereverberation and denoising,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 7, 2017.
  7. D. Rethage, J. Pons, and X. Serra, “A wavenet for speech denoising,” ICASSP, 2018.
  8. F. G. Germain, Q. Chen, and V. Koltun, “Speech denoising with deep feature losses,” Arxiv, 2018.
  9. M. H. Soni, N. Shah, and H. A. Patil, “Time-frequency masking- based speech enhancement using generative adversarial network,” ICASSP, 2018.
  10. K. Tan and D. Wang, “A convolutional recurrent neural network for real-time speech enhancement,” Interspeech, 2018.
  11. A. E. Bulut and K. Koishida, “Low-latency single channel speech enhancement using u-net convolutional neural networks,” ICASSP, 2020.
  12. O. Chapelle, J. Weston, L. Bottou, and V. Vapnik, “Vicinal risk minimization,” in NIPS, 2020.
  13. D. Tanaka, D. Ikami, T. Yamasaki, and K. Aizawa, “Joint optimization framework for learning with noisy labels,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5552–5560.
  14. C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Speech enhancement for a noise-robust text-to-speech synthesis system using deep recurrent neural networks.” 2016.
  15. H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412, 2017.
  16. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.
  17. S. Thulasidasan, G. Chennupati, J. A. Bilmes, T. Bhattacharya, and S. Michalak, “On mixup training: Improved calibration and predictive uncertainty for deep neural networks,” in Advances in Neural Information Processing Systems, 2019, pp. 13 888–13 899.
  18. H. Guo, Y. Mao, and R. Zhang, “Mixup as locally linear out-of-manifold regularization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 3714–3722.
  19. V. Verma, A. Lamb, C. Beckham, A. Najafi, I. Mitliagkas, D. Lopez-Paz, and Y. Bengio, “Manifold mixup: Better representations by interpolating hidden states,” in International Conference on Machine Learning, 2019, pp. 6438–6447.
  20. J. Kim, M. El-Khamy, and J. Lee, “T-gsa: Transformer with gaussian-weighted self-attention for speech enhancement,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 6649–6653.
  21. J. Abdulbaqi, Y. Gu, S. Chen, and I. Marsic, “Residual recurrent neural network for speech enhancement,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 6659–6663.
  22. M. H. Soni, N. Shah, and H. A. Patil, “Time-frequency masking-based speech enhancement using generative adversarial network,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 5039–5043.
  23. J. Yao and A. Al-Dahle, “Coarse-to-fine optimization for speech enhancement,” arXiv preprint arXiv:1908.08044, 2019.
  24. A. E. Bulut and K. Koishida, “Low-latency single channel speech enhancement using u-net convolutional neural networks,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 6214–6218.
  25. D. Tran and K. Koishida, “Single-channel speech enhancement by subspace affinity minimization,” in INTERSPEECH 2020.   IEEE, 2020.
  26. A. Defossez, G. Synnaeve, and Y. Adi, “Real time speech enhancement in the waveform domain,” arXiv preprint arXiv:2006.12847, 2020.
Citations (7)

Summary

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.