Papers
Topics
Authors
Recent
Search
2000 character limit reached

Crowdsourced and Automatic Speech Prominence Estimation

Published 12 Oct 2023 in eess.AS and cs.SD | (2310.08464v2)

Abstract: The prominence of a spoken word is the degree to which an average native listener perceives the word as salient or emphasized relative to its context. Speech prominence estimation is the process of assigning a numeric value to the prominence of each word in an utterance. These prominence labels are useful for linguistic analysis, as well as training automated systems to perform emphasis-controlled text-to-speech or emotion recognition. Manually annotating prominence is time-consuming and expensive, which motivates the development of automated methods for speech prominence estimation. However, developing such an automated system using machine-learning methods requires human-annotated training data. Using our system for acquiring such human annotations, we collect and open-source crowdsourced annotations of a portion of the LibriTTS dataset. We use these annotations as ground truth to train a neural speech prominence estimator that generalizes to unseen speakers, datasets, and speaking styles. We investigate design decisions for neural prominence estimation as well as how neural prominence estimation improves as a function of two key factors of annotation cost: dataset size and the number of annotations per utterance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. “Sound, structure and meaning: The bases of prominence ratings in English, French and Spanish,” Journal of Phonetics, 2019.
  2. “A crosslinguistic study of prosodic focus,” in International Conference on Acoustics, Speech, and Signal Processing, 2015.
  3. “Emphasis control for parallel neural TTS,” in Interspeech, 2022.
  4. “Prosodic prominence and boundaries in sequence-to-sequence speech synthesis,” in Speech Prosody, May 2020.
  5. “A model for varying speaking style in TTS systems,” in Speech Prosody, 2010.
  6. “Emotion recognition from speech using global and local prosodic features,” International Journal of Speech Technology, 2013.
  7. “Automatic emphatic information extraction from aligned acoustic data and its application on sentence compression,” AAAI Conference on Artificial Intelligence, 2017.
  8. “3PRO – An unsupervised method for the automatic detection of sentence prominence in speech,” Speech Communication, 2016.
  9. “Hierarchical representation and estimation of prosody using continuous wavelet transform,” Computer Speech & Language, 2017.
  10. “Supervised and unsupervised approaches for controlling narrow lexical focus in sequence-to-sequence speech synthesis,” in IEEE Spoken Language Technology Workshop, 2021.
  11. “Controlling prominence realisation in parametric DNN-based speech synthesis,” in Interspeech, 2017.
  12. “Predicting prosodic prominence from text with pre-trained contextualized word representations,” in Nordic Conference on Computational Linguistics, 2019.
  13. “BERT, can HE predict contrastive focus? predicting and controlling prominence in neural TTS using a language model,” in Interspeech, 2022.
  14. “Word prominence detection using robust yet simple prosodic features,” in Interspeech, 2012.
  15. “Automatic labelling of prosodic prominence, phrasing and disfluencies in French speech by simulating the perception of naïve and expert listeners,” in Interspeech, 2017.
  16. “Acoustic and temporal representations in convolutional neural network models of prosodic events,” Speech Communication, 2020.
  17. “Deep learning for prominence detection in children’s read speech,” in International Conference on Acoustics, Speech and Signal Processing, 2022.
  18. “Prosodic event detection in children’s read speech,” Computer Speech & Language, 2021.
  19. “Reproducible subjective evaluation,” in ICLR Workshop on ML Evaluation Standards, 2022.
  20. “Fast and easy crowdsourced perceptual audio evaluation,” in International Conference on Acoustics, Speech and Signal Processing, 2016.
  21. “Crowd-sourcing prosodic annotation,” Computer Speech & Language, 2017.
  22. “LibriTTS: A corpus derived from LibriSpeech for text-to-speech,” in Interspeech, 2019.
  23. “Bots or inattentive humans? Identifying sources of low-quality data in online platforms,” PsyArXiv preprint PsyArXiv:wr8ds, 2021.
  24. “py-irt: A scalable item response theory library for Python,” INFORMS Journal on Computing, 2023.
  25. “Rectifier nonlinearities improve neural network acoustic models,” in International Conference on Machine Learning, 2013.
  26. “Gaussian error linear units,” arXiv preprint arXiv:1606.08415, 2016.
  27. “Sigmoid-weighted linear units for neural network function approximation in reinforcement learning,” Neural networks, 2018.
  28. “Attention is all you need,” in Neural Information Processing Systems, 2017.
  29. “The Buckeye corpus of conversational speech: Labeling conventions and a test of transcriber reliability,” Speech Communication, 2005.
  30. “Speaker identification on the SCOTUS corpus,” The Journal of the Acoustical Society of America, 2008.
  31. Max Morrison, “Python forced alignment (version 0.0.3),” https://github.com/maxrmorrison/pyfoal, 2023.
  32. “On batching variable size inputs for training end-to-end speech enhancement systems,” arXiv preprint arXiv:2301.10587, 2023.
Citations (2)

Summary

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.