Papers
Topics
Authors
Recent
Search
2000 character limit reached

Electrolaryngeal Speech Intelligibility Enhancement Through Robust Linguistic Encoders

Published 18 Sep 2023 in cs.SD and eess.AS | (2309.09627v2)

Abstract: We propose a novel framework for electrolaryngeal speech intelligibility enhancement through the use of robust linguistic encoders. Pretraining and fine-tuning approaches have proven to work well in this task, but in most cases, various mismatches, such as the speech type mismatch (electrolaryngeal vs. typical) or a speaker mismatch between the datasets used in each stage, can deteriorate the conversion performance of this framework. To resolve this issue, we propose a linguistic encoder robust enough to project both EL and typical speech in the same latent space, while still being able to extract accurate linguistic information, creating a unified representation to reduce the speech type mismatch. Furthermore, we introduce HuBERT output features to the proposed framework for reducing the speaker mismatch, making it possible to effectively use a large-scale parallel dataset during pretraining. We show that compared to the conventional framework using mel-spectrogram input and output features, using the proposed framework enables the model to synthesize more intelligible and naturally sounding speech, as shown by a significant 16% improvement in character error rate and 0.83 improvement in naturalness score.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. “An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning” In IEEE/ACM TASLP 29, 2021, pp. 132–157
  2. “Disordered Speech Data Collection: Lessons Learned at 1 Million Utterances from Project Euphonia” In Proc. Interspeech, 2021, pp. 4833–4837 DOI: 10.21437/Interspeech.2021-697
  3. “Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation” In Proc. Interspeech, 2019
  4. “Extending Parrotron: An End-to-End, Speech Conversion and Speech Recognition Model for Atypical Speech” In Proc. ICASSP, 2021, pp. 6988–6992 DOI: 10.1109/ICASSP39728.2021.9414644
  5. Mark I. Singer and Eric D. Blom “An Endoscopic Technique for Restoration of Voice after Laryngectomy” PMID: 7458140 In Annals of Otology, Rhinology & Laryngology 89.6, 1980, pp. 529–533 DOI: 10.1177/000348948008900608
  6. “Mandarin Electrolaryngeal Speech Voice Conversion with Sequence-to-Sequence Modeling” In Proc. ASRU, 2021, pp. 650–657 DOI: 10.1109/ASRU51503.2021.9687908
  7. “Attention is all you need” In Proc. NeurIPS 30, 2017
  8. “Pretraining Techniques for Sequence-to-Sequence Voice Conversion” In IEEE/ACM TASLP 29, 2021, pp. 745–755
  9. “Conformer Parrotron: a Faster and Stronger End-to-end Speech Conversion and Recognition Model for Atypical Speech” In Proc. Interspeech, 2021
  10. “Two-stage training method for Japanese electrolaryngeal speech enhancement based on sequence-to-sequence voice conversion” In Proc. SLT, 2023, pp. 949–954 IEEE
  11. “Any-to-Many Voice Conversion With Location-Relative Sequence-to-Sequence Modeling” In IEEE/ACM TASLP 29, 2021, pp. 1717–1728 DOI: 10.1109/TASLP.2021.3076867
  12. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units” In IEEE/ACM TASLP 29 IEEE, 2021, pp. 3451–3460
  13. “A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion” In Proc. ICASSP, 2022
  14. Jonathan Ho, Ajay Jain and Pieter Abbeel “Denoising diffusion probabilistic models” In Proc. NeurIPS 33, 2020, pp. 6840–6851
  15. “Learning Explicit Prosody Models and Deep Speaker Embeddings for Atypical Voice Conversion” In Proc. Interspeech 2021, 2021, pp. 4813–4817 DOI: 10.21437/Interspeech.2021-285
  16. “Intermediate Fine-Tuning Using Imperfect Synthetic Speech for Improving Electrolaryngeal Speech Recognition” In Proc. ICASSP, 2023 DOI: 10.1109/ICASSP49357.2023.10095931
  17. Lester Phillip Violeta and Tomoki Toda “An Analysis of Personalized Speech Recognition System Development for the Deaf and Hard-of-Hearing” In Proc. APSIPA, 2023
  18. “Personalizing ASR for Dysarthric and Accented Speech with Limited Data” In Proc. Interspeech, 2019, pp. 784–788
  19. “Automatic Speech Recognition of Disordered Speech: Personalized Models Outperforming Human Listeners on Short Phrases” In Proc. Interspeech, 2021, pp. 4778–4782
  20. “Personalized Automatic Speech Recognition Trained on Small Disordered Speech Datasets” In Proc. ICASSP, 2022, pp. 6637–6641 DOI: 10.1109/ICASSP43922.2022.9747516
  21. Suyoun Kim, Takaaki Hori and Shinji Watanabe “Joint CTC-attention based end-to-end speech recognition using multi-task learning” In Proc. ICASSP, 2017, pp. 4835–4839
  22. Takuma Okamoto, Yoshinori Shiga and Hisashi Kawai “Hi-Fi-CAPTAIN: High-fidelity and high-capacity conversational speech synthesis corpus developed by NICT”, 2023
  23. Sungwon Kim, Heeseung Kim and Sungroh Yoon “Guided-TTS 2: A diffusion model for high-quality adaptive text-to-speech with untranscribed data” In arXiv preprint arXiv:2205.15370, 2022
  24. “Classifier-free diffusion guidance” In Proc. NeurIPS, 2021
  25. Jungil Kong, Jaehyeon Kim and Jaekyoung Bae “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis” In Proc. NeurIPS, 2020, pp. 17022–17033
  26. “Construction of a Large-Scale Japanese ASR Corpus on TV Recordings” In Proc. ICASSP, 2021, pp. 6948–6952 DOI: 10.1109/ICASSP39728.2021.9413425
  27. Ryosuke Sonobe, Shinnosuke Takamichi and Hiroshi Saruwatari “JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis” arXiv, 2017 DOI: 10.48550/ARXIV.1711.00354
  28. “JVS corpus: free Japanese multi-speaker voice corpus” In arXiv preprint arXiv:1908.06248, 2019
  29. J.Yamagishi C.Veaux and K.MacDonald “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit” In University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2012 URL: %5Curl%7Bhttp://dx.doi.org/10.7488/ds/1994%7D
  30. “Conformer: Convolution-augmented Transformer for Speech Recognition” In Proc. Interspeech, 2020, pp. 5036–5040 DOI: 10.21437/Interspeech.2020-3015
  31. “Diffsinger: Singing voice synthesis via shallow diffusion mechanism” In arXiv preprint arXiv:2105.02446 2, 2021
  32. “AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data”, 2021 arXiv:2104.09715 [cs.SD]
Citations (2)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.