Speech is More Than Words: Do Speech-to-Text Translation Systems Leverage Prosody?
Abstract: The prosody of a spoken utterance, including features like stress, intonation and rhythm, can significantly affect the underlying semantics, and as a consequence can also affect its textual translation. Nevertheless, prosody is rarely studied within the context of speech-to-text translation (S2TT) systems. In particular, end-to-end (E2E) systems have been proposed as well-suited for prosody-aware translation because they have direct access to the speech signal when making translation decisions, but the understanding of whether this is successful in practice is still limited. A main challenge is the difficulty of evaluating prosody awareness in translation. To address this challenge, we introduce an evaluation methodology and a focused benchmark (named ContraProST) aimed at capturing a wide range of prosodic phenomena. Our methodology uses LLMs and controllable text-to-speech (TTS) to generate contrastive examples. Through experiments in translating English speech into German, Spanish, and Japanese, we find that (a) S2TT models possess some internal representation of prosody, but the prosody signal is often not strong enough to affect the translations, (b) E2E systems outperform cascades of speech recognition and text translation systems, confirming their theoretical advantage in this regard, and (c) certain cascaded systems also capture prosodic information in the translation, but only to a lesser extent that depends on the particulars of the transcript's surface form.
- Prosody Generation for Speech-to-Speech Translation. In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, volume 1, pages I–I.
- Steering Large Language Models for Machine Translation with Finetuning and In-Context Learning. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11127–11148, Singapore. Association for Computational Linguistics.
- Common Voice: A Massively-Multilingual Speech Corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218–4222, Marseille, France. European Language Resources Association.
- XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale. Preprint, arXiv:2111.09296.
- wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA. Curran Associates Inc.
- Rainer Banse and Klaus R Scherer. 1996. Acoustic Profiles in Vocal Emotion Expression. Journal of personality and social psychology, 70(3):614.
- Dwight Bolinger. 1989. Intonation and Its Uses. Stanford University Press, Redwood City.
- Dwight L. Bolinger. 1961. Contrastive Accent and Contrastive Stress. Language, 37(1):83–96.
- Charles Brazier and Jean-Luc Rouas. 2024. Conditioning LLMs with Emotion in Neural Machine Translation. In Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024), pages 33–38, Bangkok, Thailand (in-person and online). Association for Computational Linguistics.
- Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- BEATs: Audio Pre-training with Acoustic Tokenizers. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
- MELD-ST: An emotion-aware speech translation dataset. In Findings of the Association for Computational Linguistics ACL 2024, pages 10118–10126, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
- Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 575–593, Toronto, Canada. Association for Computational Linguistics.
- w2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 244–250.
- Seamless: Multilingual Expressive and Streaming Speech Translation. Preprint, arXiv:2312.05187.
- Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
- Jonathan Culpeper. 2011. “It’s not what you said, it’s how you said it!” Prosody and Impoliteness, pages 57–84. De Gruyter Mouton, Berlin, New York.
- Impoliteness Revisited: With Special Reference to Dynamic and Prosodic Aspects. Journal of Pragmatics, 35:1545–1579.
- ProsAudit, a prosodic benchmark for self-supervised speech models. In Proc. INTERSPEECH 2023, pages 2963–2967.
- Nicole Dehé. 2014. Parentheticals in Spoken English : The Syntax-Prosody Relation. Cambridge [u.a.] : Cambridge University Press.
- Toward Expressive Speech Translation: A Unified Sequence-to-Sequence LSTMs Approach for Translating Words and Emphasis. In Proc. Interspeech 2017, pages 2640–2644.
- B. Efron. 1979. Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics, 7(1):1 – 26.
- Paul Ekman. 1992. Facial Expressions of Emotion: New Findings, New Questions. Psychological Science, 3(1):34–38.
- Paul Ekman and Wallace V Friesen. 1971. Constants across cultures in the face and emotion. Journal of personality and social psychology, 17(2):124.
- Automating Behavioral Testing in Machine Translation. In Proceedings of the Eighth Conference on Machine Translation, pages 1014–1030, Singapore. Association for Computational Linguistics.
- D. B. Fry. 1955. Duration and Intensity as Physical Correlates of Linguistic Stress. The Journal of the Acoustical Society of America, 27(4):765–768.
- Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, page 369–376, New York, NY, USA. Association for Computing Machinery.
- xCOMET: Transparent Machine Translation Evaluation through Fine-grained Error Detection. Preprint, arXiv:2310.10482.
- Conformer: Convolution-augmented Transformer for Speech Recognition. In Proc. Interspeech 2020, pages 5036–5040.
- Christine Gunlogson. 2002. Declarative questions. In Proceedings of Semantics and Linguistic Theory (SALT) XII, pages 124–143, Ithaca, NY. CLC Publications.
- M. A. K. Halliday. 1967. Notes on transitivity and theme in English. Part 1 and 2. Journal of Linguistics, 3:199–244.
- Julia Hirschberg. 2017. Pragmatics and Prosody (Chapter 28). In The Oxford Handbook of Pragmatics. Oxford University Press.
- HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 29:3451–3460.
- LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations.
- Large Language Models Can Self-Improve. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1051–1068, Singapore. Association for Computational Linguistics.
- Speech Translation with Large Language Models: An Industrial Practice. arXiv preprint arXiv:2312.13585.
- UnitY: Two-pass direct speech-to-speech translation with discrete units. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15655–15680, Toronto, Canada. Association for Computational Linguistics.
- Ray Jackendoff. 1972. Semantic Interpretation in Generative Grammar. MIT Press, Cambridge, Massachusetts.
- Prosodic correlates of directly reported speech: Evidence from conversational speech. In Proc. ITRW on Prosody in Speech Recognition and Understanding, page paper 14.
- Libri-Light: A Benchmark for ASR with Limited or No Supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.
- G. Klewitz and E. Couper-Kuhlen. 1999. Quote-unquote? The role of prosody in the contextualization of reported speech sequences. Pragmatics, 9(4):459–485.
- K.J. Kohler. 1991. Prosody in speech synthesis: the interplay between basic research and TTS application. Journal of Phonetics, 19(1):121–138. Speech Synthesis and Phonetics.
- Large Language Models are Zero-shot Reasoners. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA. Curran Associates Inc.
- HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA. Curran Associates Inc.
- CTC-Segmentation of Large Corpora for German End-to-End Speech Recognition. In Speech and Computer, pages 267–278, Cham. Springer International Publishing.
- J.D.R. Ladd. 1980. The Structure of Intonational Meaning: Evidence from English. Indiana University Press, Bloomington.
- The Sound of Emotional Prosody: Nearly 3 Decades of Research and Future Directions. Perspectives on Psychological Science, 0(0):17456916231217722. PMID: 38232303.
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
- Mark Y. Liberman and Richard Sproat. 1992. The Stress and Structure of Modified Noun Phrases in English. In Lexical Matters.
- Simpson’s paradox and the accuracy-fluency tradeoff in translation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 92–103, Bangkok, Thailand. Association for Computational Linguistics.
- Steven R. Livingstone and Frank A. Russo. 2018. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLOS ONE, 13(5):1–35.
- Marina Nespor and Irene Vogel. 1986. Prosodic Phonology. Phonology, 5(1):161–168.
- Giridhar Pamisetty and K. Sri Rama Murty. 2022. Prosody-TTS: An End-to-End Speech Synthesis System with Prosody Control. Circuits Syst. Signal Process., 42(1):361–384.
- LibriSpeech: An ASR Corpus Based on Public Domain Audio Books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210.
- Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- Instruction Tuning with GPT-4. Preprint, arXiv:2304.03277.
- Gabriel Peyré and Marco Cuturi. 2019. Computational Optimal Transport: With Applications to Data Science. Now Foundations and Trends.
- Jose Pinheiro and Douglas M. Bates. 2006. Mixed-effects Models in S and S-PLUS. Statistics and Computing. Springer Science & Business Media, New York.
- The Use of Prosody in Syntactic Disambiguation. In Proceedings of the Workshop on Speech and Natural Language, HLT ’91, page 372–377, USA. Association for Computational Linguistics.
- Joel Pynte. 1996. Prosodic Breaks and Attachment Decisions in Sentence Parsing. Language and Cognitive Processes, 11(1-2):165–192.
- Robust Speech Recognition via Large-Scale Weak Supervision. Preprint, arXiv:2212.04356.
- Beyond accuracy: Behavioral testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Online. Association for Computational Linguistics.
- ChatGPT MT: Competitive for High- (but Not Low-) Resource Languages. In Proceedings of the Eighth Conference on Machine Translation, pages 392–418, Singapore. Association for Computational Linguistics.
- Rico Sennrich. 2017. How Grammatical is Character-level Neural Machine Translation? Assessing MT Quality with Contrastive Translation Pairs. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 376–382, Valencia, Spain. Association for Computational Linguistics.
- Can Prosody Aid the Automatic Classification of Dialog Acts in Conversational Speech? Language and Speech, 41(3-4):443–492. PMID: 10746366.
- Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4693–4702. PMLR.
- Matthias Sperber and Matthias Paulik. 2020. Speech Translation and the End-to-End Promise: Taking Stock of Where We Are. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7409–7421, Online. Association for Computational Linguistics.
- SALMONN: Towards Generic Hearing Abilities for Large Language Models. In The Twelfth International Conference on Learning Representations.
- Multilingual Translation with Extensible Multilingual Pretraining and Finetuning. Preprint, arXiv:2008.00401.
- ADEPT: A Dataset for Evaluating Prosody Transfer. In Proc. Interspeech 2021, pages 3880–3884.
- Pushing the Limits of Zero-shot End-to-End Speech Translation. In Findings of the Association for Computational Linguistics ACL 2024, pages 14245–14267, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
- Jannis Vamvas and Rico Sennrich. 2021. On the Limits of Minimal Pairs in Contrastive Evaluation. In Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 58–68, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY, USA. Curran Associates Inc.
- Prompting PaLM for Translation: Assessing Strategies and Performance. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15406–15427, Toronto, Canada. Association for Computational Linguistics.
- Michael Wagner. 2020. Prosodic Focus. In Daniel Gutzmann, Lisa Matthewson, Ceclia Meier, Hotze Rullmann, and Thomas E. Zimmermann, editors, The Wiley Blackwell Companion to Semantics. Wiley–Blackwell.
- CoVoST 2 and Massively Multilingual Speech Translation. In Proc. Interspeech 2021, pages 2247–2251.
- Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- Self-training and Pre-training are Complementary for Speech Recognition. Preprint, arXiv:2010.11430.
- Prompting Large Language Model for Machine Translation: A Case Study. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
- Prosody in Cascade and Direct Speech-to-Text Translation: a case study on Korean Wh-Phrases. In Findings of the Association for Computational Linguistics: EACL 2024, pages 674–683, St. Julian’s, Malta. Association for Computational Linguistics.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.