Multi-Task Learning for Front-End Text Processing in TTS
Abstract: We propose a multi-task learning (MTL) model for jointly performing three tasks that are commonly solved in a text-to-speech (TTS) front-end: text normalization (TN), part-of-speech (POS) tagging, and homograph disambiguation (HD). Our framework utilizes a tree-like structure with a trunk that learns shared representations, followed by separate task-specific heads. We further incorporate a pre-trained LLM to utilize its built-in lexical and contextual knowledge, and study how to best use its embeddings so as to most effectively benefit our multi-task model. Through task-wise ablations, we show that our full model trained on all three tasks achieves the strongest overall performance compared to models trained on individual or sub-combinations of tasks, confirming the advantages of our MTL framework. Finally, we introduce a new HD dataset containing a balanced number of sentences in diverse contexts for a variety of homographs and their pronunciations. We demonstrate that incorporating this dataset into training significantly improves HD performance over only using a commonly used, but imbalanced, pre-existing dataset.
- Richard Sproat et al., “Normalization of Non-Standard Words,” Computer Speech & Language, vol. 15, no. 3, pp. 287–333, 2001.
- “Adaptive Latency for Part-of-Speech Tagging in Incremental Text-to-Speech Synthesis,” in Proc. Interspeech, 2016, pp. 2846–2850.
- David Yarowsky, “Homograph Disambiguation in Text-to-Speech Synthesis,” in Progress in Speech Synthesis, pp. 157–172. Springer, 1997.
- “Joint-Sequence Models for Grapheme-to-Phoneme Conversion,” Speech Communication, vol. 50, no. 5, pp. 434–451, 2008.
- “RNN Approaches to Text Normalization: A Challenge,” arXiv preprint arXiv:1611.00068, 2016.
- Hao Zhang et al., “Neural Models of Text Normalization for Speech Applications,” Computational Linguistics, vol. 45, no. 2, pp. 293–337, 2019.
- Courtney Mansfield et al., “Neural Text Normalization with Subword Units,” in Proceedings of NAACL-HLT, 2019, pp. 190–196.
- “Improving Homograph Disambiguation with Supervised Machine Learning,” in Proceedings of LREC, 2018.
- “Homograph Disambiguation with Contextual Word Embeddings for TTS Systems,” in Interspeech Workshop on Speech Synthesis (SSW11), 2021.
- “Scalable Multilingual Frontend for TTS,” in Proceedings of ICASSP. IEEE, 2020, pp. 6684–6688.
- Rich Caruana, “Multitask Learning,” Machine Learning, vol. 28, pp. 41–75, 1997.
- Zhenzhong Lan et al., “ALBERT: A Lite BERT for Self-supervised Learning of Language Representations,” in Proceedings of ICLR, 2019.
- Hugo Touvron et al., “Llama 2: Open Foundation and Fine-Tuned Chat Models,” arXiv preprint arXiv:2307.09288, 2023.
- Daan van Esch and Richard Sproat, “An Expanded Taxonomy of Semiotic Classes for Text Normalization,” in Proc. Interspeech, 2017, pp. 4016–4020.
- Richard Sproat, “Multilingual Text Analysis for Text-to-Speech Synthesis,” Natural Language Engineering, vol. 2, no. 4, pp. 369–380, 1996.
- Brian Roark et al., “The OpenGrm Open-Source Finite-State Grammar Software Libraries,” in Proceedings of the ACL 2012 System Demonstrations, 2012, pp. 61–66.
- “An RNN Model of Text Normalization,” in Proc. Interspeech, 2017, pp. 754–758.
- “Transformer-based Models of Text Normalization for Speech Applications,” arXiv preprint arXiv:2202.00153, 2022.
- “Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems,” in Proceedings of NAACL-HLT: Industry Papers, 2021, pp. 72–79.
- Michael Collins, “Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms,” in Proceedings of EMNLP, 2002, pp. 1–8.
- “A Challenge Set and Methods for Noun-Verb Ambiguity,” in Proceedings of EMNLP, 2018, pp. 2562–2572.
- Bernd Bohnet et al., “Morphosyntactic Tagging with a Meta-BiLSTM Model over Context Sensitive Token Encodings,” in Proceedings of ACL, 2018, pp. 2642–2652.
- “Multi-Task Learning for Sequence Tagging: An Empirical Study,” in Proceedings of COLING, 2018, pp. 2965–2977.
- “A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning,” in Proceedings of ICML. PMLR, 2008, pp. 160–167.
- “Multi-Task Deep Neural Networks for Natural Language Understanding,” in Proceedings of ACL, 2019, pp. 4487–4496.
- “Understanding and Improving Information Transfer in Multi-Task Learning,” in Proceedings of ICLR, 2019.
- “A Text Editing Approach to Joint Japanese Word Segmentation, POS Tagging, and Lexical Normalization,” in Proceedings of the Seventh Workshop on Noisy User-Generated Text (W-NUT), 2021, pp. 67–80.
- “Joint POS Tagging and Text Normalization for Informal Text,” in Proceedings of IJCAI, 2015.
- “A Unified Front-End Framework for English Text-to-Speech Synthesis,” arXiv preprint arXiv:2305.10666, 2023.
- Ashish Vaswani et al., “Attention is All you Need,” Advances in Neural Information Processing Systems, vol. 30, 2017.
- “What Does BERT Learn about the Structure of Language?,” in Proceedings of ACL, 2019, pp. 3651–3657.
- Minh-Thang Luong et al., “Multi-task Sequence to Sequence Learning,” in Proceedings of ICLR, 2016.
- “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” in Proceedings of ICML. PMLR, 2015, pp. 448–456.
- “Switchboard SWBD-DAMSL Shallow-Discourse-Function Annotation (Coders Manual, Draft 13),” Tech. Rep., University of Colorado, Institute of Cognitive Science, 97-02, 1997.
- “Decoupled Weight Decay Regularization,” in Proceedings of ICLR, 2018.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.