Papers
Topics
Authors
Recent
Search
2000 character limit reached

Separate And Diffuse: Using a Pretrained Diffusion Model for Improving Source Separation

Published 25 Jan 2023 in eess.AS and cs.AI | (2301.10752v2)

Abstract: The problem of speech separation, also known as the cocktail party problem, refers to the task of isolating a single speech signal from a mixture of speech signals. Previous work on source separation derived an upper bound for the source separation task in the domain of human speech. This bound is derived for deterministic models. Recent advancements in generative models challenge this bound. We show how the upper bound can be generalized to the case of random generative models. Applying a diffusion model Vocoder that was pretrained to model single-speaker voices on the output of a deterministic separation model leads to state-of-the-art separation results. It is shown that this requires one to combine the output of the separation model with that of the diffusion model. In our method, a linear combination is performed, in the frequency domain, using weights that are inferred by a learned model. We show state-of-the-art results on 2, 3, 5, 10, and 20 speakers on multiple benchmarks. In particular, for two speakers, our method is able to surpass what was previously considered the upper performance bound.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. High fidelity speech synthesis with adversarial networks. In International Conference on Learning Representations, 2019.
  2. Wavegrad: Estimating gradients for waveform generation. In International Conference on Learning Representations, 2020.
  3. Wavegrad 2: Iterative refinement for text-to-speech synthesis. arXiv preprint arXiv:2106.09660, 2021.
  4. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, USA, 2006.
  5. Many-speakers single channel speech separation with optimal permutation training. In Annual Conference of the International Speech Communication Association (INTERSPEECH), 2021.
  6. CSR-I (WSJ0) complete, 1993.
  7. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  8. Deep clustering: Discriminative embeddings for segmentation and separation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 31–35. IEEE, 2016.
  9. Source separation with deep generative priors. In International Conference on Machine Learning, pages 4724–4735. PMLR, 2020.
  10. Efficient neural audio synthesis. In International Conference on Machine Learning, pages 2410–2419. PMLR, 2018.
  11. Adam: A method for stochastic optimization. In ICLR, 2015.
  12. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(10):1901–1913, 2017.
  13. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33:17022–17033, 2020.
  14. Diffwave: A versatile diffusion model for audio synthesis. In International Conference on Learning Representations, 2020.
  15. SDR–half-baked or well done? In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 626–630. IEEE, 2019.
  16. MNIST handwritten digit database. 2010.
  17. A kernelized stein discrepancy for goodness-of-fit tests. In International conference on machine learning, pages 276–284. PMLR, 2016.
  18. G Logeshwari and GS Anandha Mala. A survey on single channel speech separation. In International Conference on Advances in Communication, Network, and Computing, pages 387–393. Springer, 2012.
  19. Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation. arXiv preprint arXiv:1910.06379, 2019.
  20. Yi Luo and Nima Mesgarani. Tasnet: time-domain audio separation network for real-time, single-channel speech separation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 696–700. IEEE, 2018.
  21. Yi Luo and Nima Mesgarani. Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and language processing, 27(8):1256–1266, 2019.
  22. Sepit: Approaching a single channel speech separation bound. In Hanseok Ko and John H. L. Hansen, editors, Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, pages 5323–5327. ISCA, 2022.
  23. Single-channel speech presence probability estimation and noise tracking. Audio Source Separation and Speech Enhancement, pages 87–106, 2018.
  24. Voice separation with an unknown number of multiple speakers. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 7164–7175, 2020.
  25. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
  26. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015.
  27. Diffusion-based generative speech source separation. arXiv preprint arXiv:2210.17327, 2022.
  28. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
  29. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019.
  30. Attention is all you need in speech separation. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 21–25. IEEE, 2021.
  31. What makes a good model of natural images? In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2007.
  32. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 241–245. IEEE, 2017.
  33. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
  34. Furcanext: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks. In International Conference on Multimedia Modeling, pages 653–665. Springer, 2020.
Citations (14)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.