Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pureformer-VC: Non-parallel One-Shot Voice Conversion with Pure Transformer Blocks and Triplet Discriminative Training

Published 3 Sep 2024 in cs.SD, cs.AI, and eess.AS | (2409.01668v3)

Abstract: One-shot voice conversion(VC) aims to change the timbre of any source speech to match that of the target speaker with only one speech sample. Existing style transfer-based VC methods relied on speech representation disentanglement and suffered from accurately and independently encoding each speech component and recomposing back to converted speech effectively. To tackle this, we proposed Pureformer-VC, which utilizes Conformer blocks to build a disentangled encoder, and Zipformer blocks to build a style transfer decoder as the generator. In the decoder, we used effective styleformer blocks to integrate speaker characteristics effectively into the generated speech. The models used the generative VAE loss for encoding components and triplet loss for unsupervised discriminative training. We applied the styleformer method to Zipformer's shared weights for style transfer. The experimental results show that the proposed model achieves comparable subjective scores and exhibits improvements in objective metrics compared to existing methods in a one-shot voice conversion scenario.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “Cyclegan-vc3: Examining and improving cyclegan-vcs for mel-spectrogram conversion,” arXiv preprint arXiv:2010.11672, 2020.
  2. ——, “Maskcyclegan-vc: Learning non-parallel voice conversion with filling in frames,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 5919–5923.
  3. H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “Stargan-vc: Non-parallel many-to-many voice conversion using star generative adversarial networks,” in 2018 IEEE Spoken Language Technology Workshop (SLT).   IEEE, 2018, pp. 266–273.
  4. T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “Stargan-vc2: Rethinking conditional methods for stargan-based voice conversion,” arXiv preprint arXiv:1907.12279, 2019.
  5. Y. A. Li, A. Zare, and N. Mesgarani, “Starganv2-vc: A diverse, unsupervised, non-parallel framework for natural-sounding voice conversion,” arXiv preprint arXiv:2107.10394, 2021.
  6. M. Chen, Y. Shi, and T. Hain, “Towards low-resource stargan voice conversion using weight adaptive instance normalization,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 5949–5953.
  7. X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1501–1510.
  8. T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8110–8119.
  9. K. Qian, Y. Zhang, S. Chang, M. Hasegawa-Johnson, and D. Cox, “Unsupervised speech decomposition via triple information bottleneck,” in International Conference on Machine Learning.   PMLR, 2020, pp. 7836–7846.
  10. C. H. Chan, K. Qian, Y. Zhang, and M. Hasegawa-Johnson, “Speechsplit2. 0: Unsupervised speech disentanglement for voice conversion without tuning autoencoder bottlenecks,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 6332–6336.
  11. J.-c. Chou, C.-c. Yeh, and H.-y. Lee, “One-shot voice conversion by separating speaker and content representations with instance normalization,” arXiv preprint arXiv:1904.05742, 2019.
  12. K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, “Autovc: Zero-shot voice style transfer with only autoencoder loss,” in International Conference on Machine Learning.   PMLR, 2019, pp. 5210–5219.
  13. Z. Yao, L. Guo, X. Yang, W. Kang, F. Kuang, Y. Yang, Z. Jin, L. Lin, and D. Povey, “Zipformer: A faster and better encoder for automatic speech recognition,” arXiv preprint arXiv:2310.11230, 2023.
  14. A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.
  15. X. Wu, Z. Hu, L. Sheng, and D. Xu, “Styleformer: Real-time arbitrary style transfer via parametric style composition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 14 618–14 627.
  16. A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for person re-identification,” arXiv preprint arXiv:1703.07737, 2017.
  17. D. Ke, W. Yao, R. Hu, L. Huang, Q. Luo, and W. Shu, “A new spoken language teaching tech: Combining multi-attention and adain for one-shot cross language voice conversion,” in 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP).   IEEE, 2022, pp. 101–104.
  18. J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in neural information processing systems, vol. 33, pp. 17 022–17 033, 2020.
  19. T. Salimans and D. P. Kingma, “Weight normalization: A simple reparameterization to accelerate training of deep neural networks,” Advances in neural information processing systems, vol. 29, 2016.
  20. Q. Zhu, J. Su, W. Bi, X. Liu, X. Ma, X. Li, and D. Wu, “A batch normalized inference network keeps the kl vanishing away,” arXiv preprint arXiv:2004.12585, 2020.
  21. Y. Zhang, Z. Lv, H. Wu, S. Zhang, P. Hu, Z. Wu, H.-y. Lee, and H. Meng, “Mfa-conformer: Multi-scale feature aggregation conformer for automatic speaker verification,” arXiv preprint arXiv:2203.15249, 2022.
  22. C. Veaux, J. Yamagishi, and S. King, “Vctk corpus: English multi-speaker corpus for cstr voice cloning,” arXiv preprint arXiv:2012.11929 [cs.Speech], 2020.
  23. D. Wang, L. Deng, Y. T. Yeung, X. Chen, X. Liu, and H. Meng, “Vqmivc: Vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion,” arXiv preprint arXiv:2106.10132, 2021.
  24. P. Li, J. Wang, X. Zhang, Y. Zhang, J. Xiao, and N. Cheng, “Main-vc: Lightweight speech representation disentanglement for one-shot voice conversion,” arXiv preprint arXiv:2405.00930, 2024.
  25. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.