Papers
Topics
Authors
Recent
Search
2000 character limit reached

Singer Identity Representation Learning using Self-Supervised Techniques

Published 10 Jan 2024 in cs.SD, cs.LG, and eess.AS | (2401.05064v1)

Abstract: Significant strides have been made in creating voice identity representations using speech data. However, the same level of progress has not been achieved for singing voices. To bridge this gap, we suggest a framework for training singer identity encoders to extract representations suitable for various singing-related tasks, such as singing voice similarity and synthesis. We explore different self-supervised learning techniques on a large collection of isolated vocal tracks and apply data augmentations during training to ensure that the representations are invariant to pitch and content variations. We evaluate the quality of the resulting representations on singer similarity and identification tasks across multiple datasets, with a particular emphasis on out-of-domain generalization. Our proposed framework produces high-quality embeddings that outperform both speaker verification and wav2vec 2.0 pre-trained baselines on singing voice while operating at 44.1 kHz. We release our code and trained models to facilitate further research on singing voice and related areas.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. S.Ā Wang, J.Ā Liu, Y.Ā Ren, Z.Ā Wang, C.Ā Xu, and Z.Ā Zhao, ā€œMR-SVS: Singing voice synthesis with multi-reference encoder,ā€ CoRR, vol. abs/2201.03864, 2022.
  2. S.Ā Nercessian, ā€œEnd-to-End Zero-Shot Voice Conversion Using a DDSP Vocoder,ā€ in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, Oct. 2021, pp. 1–5.
  3. T.Ā Chen, S.Ā Kornblith, M.Ā Norouzi, and G.Ā Hinton, ā€œA simple framework for contrastive learning of visual representations,ā€ in International Conference on Machine Learning, 2020, pp. 1597–1607.
  4. F.Ā Wang and H.Ā Liu, ā€œUnderstanding the Behaviour of Contrastive Loss,ā€ in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Ā Ā Ā Nashville, TN, USA: IEEE, Jun. 2021, pp. 2495–2504.
  5. A.Ā Bardes, J.Ā Ponce, and Y.Ā LeCun, ā€œVICReg: Variance-invariance-covariance regularization for self-supervised learning,ā€ in ICLR, 2022.
  6. J.-B. Grill, F.Ā Strub, F.Ā AltchĆ©, C.Ā Tallec, P.Ā H. Richemond, E.Ā Buchatskaya, C.Ā Doersch, B. Á. Pires, Z.Ā Guo, M.Ā G. Azar, B.Ā Piot, K.Ā Kavukcuoglu, R.Ā Munos, and M.Ā Valko, ā€œBootstrap your own latent - A new approach to self-supervised learning,ā€ in NeurIPS, 2020.
  7. S.Ā Ternstrƶm, ā€œHi-Fi voice: Observations on the distribution of energy in the singing voice spectrum above 5 kHz,ā€ in Acoustics’ 08, Paris, France, Jun 29-Jul 4, 2008, 2008, pp. 3171–3176.
  8. B.Ā B. Monson, E.Ā J. Hunter, A.Ā J. Lotto, and B.Ā H. Story, ā€œThe perceptual significance of high-frequency energy in the human voice,ā€ Frontiers in psychology, vol.Ā 5, p. 587, 2014.
  9. Md.Ā Sahidullah, S.Ā Chakroborty, and G.Ā Saha, ā€œOn the use of perceptual Line Spectral pairs Frequencies and higher-order residual moments for Speaker Identification,ā€ IJBM, vol.Ā 2, no.Ā 4, p. 358, 2010.
  10. L.Ā Regnier and G.Ā Peeters, ā€œSinger verification: Singer model .vs. song model,ā€ in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).Ā Ā Ā Kyoto, Japan: IEEE, 2012, pp. 437–440.
  11. T.Ā Nakano, K.Ā Yoshii, and M.Ā Goto, ā€œVocal timbre analysis using latent Dirichlet allocation and cross-gender vocal timbre similarity,ā€ in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).Ā Ā Ā IEEE, 2014, pp. 5202–5206.
  12. A.Ā Mesaros, T.Ā Virtanen, and A.Ā Klapuri, ā€œSinger identification in polyphonic music using vocal separation and pattern recognition methods.ā€ in Proc. of the 8th International Society for Music Information Retrieval Conference (ISMIR), 2007, pp. 375–378.
  13. M.Ā Lagrange, A.Ā Ozerov, and E.Ā Vincent, ā€œRobust singer identification in polyphonic music using melody enhancement and uncertainty-based learning,ā€ in Proc. of the 13th International Society for Music Information Retrieval Conference (ISMIR), 2012.
  14. B.Ā Sharma, R.Ā K. Das, and H.Ā Li, ā€œOn the Importance of Audio-Source Separation for Singer Identification in Polyphonic Music,ā€ in Interspeech 2019.Ā Ā Ā ISCA, Sep. 2019, pp. 2020–2024.
  15. T.-H. Hsieh, K.-H. Cheng, Z.-C. Fan, Y.-C. Yang, and Y.-H. Yang, ā€œAddressing the confounds of accompaniments in singer identification,ā€ in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 1–5.
  16. N.Ā Dehak, P.Ā J. Kenny, R.Ā Dehak, P.Ā Dumouchel, and P.Ā Ouellet, ā€œFront-end factor analysis for speaker verification,ā€ IEEE Transactions on Audio, Speech, and Language Processing, vol.Ā 19, no.Ā 4, pp. 788–798, 2010.
  17. D.Ā Snyder, D.Ā Garcia-Romero, G.Ā Sell, D.Ā Povey, and S.Ā Khudanpur, ā€œX-Vectors: Robust DNN Embeddings for Speaker Recognition,ā€ in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).Ā Ā Ā IEEE, Apr. 2018, pp. 5329–5333.
  18. J.-w. Jung, Y.Ā J. Kim, H.-S. Heo, B.-J. Lee, Y.Ā Kwon, and J.Ā S. Chung, ā€œPushing the limits of raw waveform speaker recognition,ā€ in Interspeech.Ā Ā Ā ISCA, 2022, pp. 2228–2232.
  19. M.Ā Sang, H.Ā Li, F.Ā Liu, A.Ā O. Arnold, and L.Ā Wan, ā€œSelf-supervised speaker verification with simple siamese network and self-supervised regularization,ā€ in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).Ā Ā Ā IEEE, 2022, pp. 6127–6131.
  20. W.Ā Xia, C.Ā Zhang, C.Ā Weng, M.Ā Yu, and D.Ā Yu, ā€œSelf-supervised text-independent speaker verification using prototypical momentum contrastive learning,ā€ in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).Ā Ā Ā IEEE, 2021, pp. 6723–6727.
  21. T.Ā Lepage and R.Ā Dehak, ā€œLabel-efficient self-supervised speaker verification with information maximization and contrastive learning,ā€ in Proc. Interspeech 2022.Ā Ā Ā ISCA, Sep. 2022, pp. 4018–4022.
  22. K.Ā He, H.Ā Fan, Y.Ā Wu, S.Ā Xie, and R.Ā Girshick, ā€œMomentum contrast for unsupervised visual representation learning,ā€ in Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9729–9738.
  23. A.Ā vanĀ den Oord, Y.Ā Li, and O.Ā Vinyals, ā€œRepresentation learning with contrastive predictive coding,ā€ arXiv preprint arXiv:1807.03748, 2018.
  24. A.Ā Baevski, Y.Ā Zhou, A.Ā Mohamed, and M.Ā Auli, ā€œWav2vec 2.0: A framework for self-supervised learning of speech representations,ā€ Advances in neural information processing systems, vol.Ā 33, pp. 12 449–12 460, 2020.
  25. W.-N. Hsu, B.Ā Bolte, Y.-H.Ā H. Tsai, K.Ā Lakhotia, R.Ā Salakhutdinov, and A.Ā Mohamed, ā€œHubert: Self-supervised speech representation learning by masked prediction of hidden units,ā€ IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol.Ā 29, pp. 3451–3460, 2021.
  26. S.Ā Chen, C.Ā Wang, Z.Ā Chen, Y.Ā Wu, S.Ā Liu, Z.Ā Chen, J.Ā Li, N.Ā Kanda, T.Ā Yoshioka, X.Ā Xiao etĀ al., ā€œWavlm: Large-scale self-supervised pre-training for full stack speech processing,ā€ IEEE Journal of Selected Topics in Signal Processing, vol.Ā 16, no.Ā 6, pp. 1505–1518, 2022.
  27. A.Ā Saeed, D.Ā Grangier, and N.Ā Zeghidour, ā€œContrastive learning of general-purpose audio representations,ā€ in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).Ā Ā Ā IEEE, 2021, pp. 3875–3879.
  28. H.Ā Al-Tahan and Y.Ā Mohsenzadeh, ā€œClar: Contrastive learning of auditory representations,ā€ in International Conference on Artificial Intelligence and Statistics, 2021, pp. 2530–2538.
  29. J.Ā Spijkervet and J.Ā A. Burgoyne, ā€œContrastive learning of musical representations,ā€ in Proc. of the 22nd International Society for Music Information Retrieval Conference (ISMIR), 2021, pp. 673–681.
  30. C.-i. Wang and G.Ā Tzanetakis, ā€œSinging Style Investigation by Residual Siamese Convolutional Neural Networks,ā€ in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2018, pp. 116–120.
  31. H.Ā Yakura, K.Ā Watanabe, and M.Ā Goto, ā€œSelf-Supervised Contrastive Learning for Singing Voices,ā€ IEEE/ACM Trans. Audio Speech Lang. Process., vol.Ā 30, pp. 1614–1623, 2022.
  32. B.Ā Desplanques, J.Ā Thienpondt, and K.Ā Demuynck, ā€œECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,ā€ in Interspeech.Ā Ā Ā ISCA, 2020, pp. 3830–3834.
  33. J.Ā S. Chung, J.Ā Huh, S.Ā Mun, M.Ā Lee, H.Ā S. Heo, S.Ā Choe, C.Ā Ham, S.Ā Jung, B.-J. Lee, and I.Ā Han, ā€œIn defence of metric learning for speaker recognition,ā€ in Interspeech 2020, Oct. 2020, pp. 2977–2981.
  34. M.Ā Tan and Q.Ā Le, ā€œEfficientnet: Rethinking model scaling for convolutional neural networks,ā€ in International Conference on Machine Learning, 2019, pp. 6105–6114.
  35. C.-H. Yeh, C.-Y. Hong, Y.-C. Hsu, T.-L. Liu, Y.Ā Chen, and Y.Ā LeCun, ā€œDecoupled contrastive learning,ā€ in ECCV (26), ser. Lecture Notes in Computer Science, vol. 13686.Ā Ā Ā Springer, 2022, pp. 668–684.
  36. J.Ā Yamagishi, C.Ā Veaux, and K.Ā MacDonald, ā€œCSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92),ā€ 2019.
  37. Z.Ā Duan, H.Ā Fang, B.Ā Li, K.Ā C. Sim, and Y.Ā Wang, ā€œThe NUS sung and spoken lyrics corpus: A quantitative comparison of singing and speech,ā€ in APSIPA.Ā Ā Ā IEEE, 2013, pp. 1–9.
  38. J.Ā Wilkins, P.Ā Seetharaman, A.Ā Wahl, and B.Ā Pardo, ā€œVocalSet: A Singing Voice Dataset.ā€ in Proc. of the 19th International Society for Music Information Retrieval Conference (ISMIR), 2018, pp. 468–474.
  39. L.Ā Zhang, R.Ā Li, S.Ā Wang, L.Ā Deng, J.Ā Liu, Y.Ā Ren, J.Ā He, R.Ā Huang, J.Ā Zhu, X.Ā Chen etĀ al., ā€œM4Singer: A multi-style, multi-singer and musical score provided mandarin singing corpus,ā€ Advances in Neural Information Processing Systems, vol.Ā 35, pp. 6914–6926, 2022.
  40. S.Ā Lattner, ā€œSampleMatch: Drum sample retrieval by musical context,ā€ in Proc. of the 23rd International Society for Music Information Retrieval Conference (ISMIR), 2022, pp. 781–788.
  41. D.Ā Niizumi, D.Ā Takeuchi, Y.Ā Ohishi, N.Ā Harada, and K.Ā Kashino, ā€œBYOL for audio: Exploring pre-trained general-purpose audio representations,ā€ IEEE ACM Trans. Audio Speech Lang. Process., vol.Ā 31, pp. 137–151, 2023.
  42. S.-W. Yang, P.-H. Chi, Y.-S. Chuang, C.-I.Ā J. Lai, K.Ā Lakhotia, Y.Ā Y. Lin, A.Ā T. Liu, J.Ā Shi, X.Ā Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K.-t. Lee, D.-R. Liu, Z.Ā Huang, S.Ā Dong, S.-W. Li, S.Ā Watanabe, A.Ā Mohamed, and H.-y. Lee, ā€œSUPERB: Speech processing universal PERformance benchmark,ā€ in Interspeech.Ā Ā Ā ISCA, 2021, pp. 1194–1198.
  43. L.Ā Wan, Q.Ā Wang, A.Ā Papir, and I.Ā L. Moreno, ā€œGeneralized end-to-end loss for speaker verification,ā€ in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).Ā Ā Ā IEEE, 2018, pp. 4879–4883.
  44. Y.Ā Kwon, H.-S. Heo, B.-J. Lee, and J.Ā S. Chung, ā€œThe ins and outs of speaker recognition: Lessons from VoxSRC 2020,ā€ in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).Ā Ā Ā IEEE, 2021, pp. 5809–5813.
  45. A.Ā Conneau, A.Ā Baevski, R.Ā Collobert, A.Ā Mohamed, and M.Ā Auli, ā€œUnsupervised cross-lingual representation learning for speech recognition,ā€ in Interspeech.Ā Ā Ā ISCA, 2021, pp. 2426–2430.
  46. H.-S. Choi, J.Ā Lee, W.Ā Kim, J.Ā Lee, H.Ā Heo, and K.Ā Lee, ā€œNeural analysis and synthesis: Reconstructing speech from self-supervised representations,ā€ Advances in Neural Information Processing Systems, vol.Ā 34, pp. 16 251–16 265, 2021.
  47. P.Ā Boersma and D.Ā Weenink, ā€œPraat: Doing phonetics by computer (Version 5.1.13),ā€ 2009.
  48. Y.Ā Jadoul, B.Ā Thompson, and B.Ā de Boer, ā€œIntroducing Parselmouth: A Python interface to Praat,ā€ Journal of Phonetics, vol.Ā 71, pp. 1–15, Nov. 2018.
  49. Z.Ā Fan, M.Ā Li, S.Ā Zhou, and B.Ā Xu, ā€œExploring wav2vec 2.0 on Speaker Verification and Language Identification,ā€ in Interspeech 2021.Ā Ā Ā ISCA, Aug. 2021, pp. 1509–1513.
Citations (4)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 11 likes about this paper.