Speaker-independent Speech Inversion for Estimation of Nasalance

Published 31 May 2023 in eess.AS | (2306.00203v1)

Abstract: The velopharyngeal (VP) valve regulates the opening between the nasal and oral cavities. This valve opens and closes through a coordinated motion of the velum and pharyngeal walls. Nasalance is an objective measure derived from the oral and nasal acoustic signals that correlate with nasality. In this work, we evaluate the degree to which the nasalance measure reflects fine-grained patterns of VP movement by comparison with simultaneously collected direct measures of VP opening using high-speed nasopharyngoscopy (HSN). We show that nasalance is significantly correlated with the HSN signal, and that both match expected patterns of nasality. We then train a temporal convolution-based speech inversion system in a speaker-independent fashion to estimate VP movement for nasality, using nasalance as the ground truth. In further experiments, we also show the importance of incorporating source features (from glottal activity) to improve nasality prediction.

Abstract PDF Upgrade to Chat

Citations (3)

View on Semantic Scholar

Summary

The paper introduces a speaker-independent speech inversion system using nasalance as a proxy validated by high-speed nasopharyngoscopy to estimate velopharyngeal movement dynamics.
Findings show a significant correlation between nasalance and direct measures, reinforcing the validity of using nasalance for estimating nasality patterns non-invasively.
Integrating source features like electroglottography improves estimation accuracy, enhancing non-invasive articulatory measurement for clinical and linguistic applications.

Analyzing Speaker-Independent Speech Inversion for Estimation of Nasalance

The paper "Speaker-independent Speech Inversion for Estimation of Nasalance" introduces a significant exploration into the evaluation and estimation of velopharyngeal (VP) valve movement, which mediates the acoustic interaction between the oral and nasal cavities, thus playing a crucial role in speech production. This study primarily focuses on correlating nasalance, an objective measure determined by oral and nasal acoustic signals, with direct measures of VP opening via high-speed nasopharyngoscopy (HSN). The implications of this research span across improving non-invasive articulatory measurements, which can be pivotal in both clinical and linguistic contexts.

Summary of Approach and Findings

The authors present a novel application of a temporal convolution-based speech inversion (SI) system designed to predict VP movement dynamics from speech signals in a speaker-independent manner. The study leverages the nasalance as a proxy measure, validated by HSN analysis, to serve as a ground truth for training the SI models. The findings illustrate a significant correlation between the nasalance and HSN signals, reinforcing the validity of nasalance in estimating nasality patterns. The methodological emphasis on a non-invasive measure (nasalance) aligns with the pursuit of broader applicability across diverse speaking populations.

To enhance nasality prediction accuracy, the study also integrates source features from glottal activity, such as those derived from Electroglottography (EGG), alongside the main acoustic inputs like auditory spectrograms. The inclusion of these features demonstrates improved estimation of VP movements, as evidenced by substantial correlation coefficients obtained through Pearson Product Moment Correlation (PPMC) scores.

Detailed Analysis of Methodology

The paper's methodology involves a sophisticated approach to data collection and processing. A subset of data from an ongoing collaboration was utilized, featuring both nasal and oral mic recordings across multiple speakers with varying demographic attributes. The dataset preparation includes syncing direct measures from HSN and acoustic recordings to calculate an informative and reliable nasalance parameter. This process exhibits rigorous computational techniques such as high-pass filtering, RMS calculations, and precise signal synchronization.

The SI system employs a Temporal Convolution Network architecture, fine-tuned with parameters like learning rate, batch size, and optimizer functioning, enabling the SI models to learn the intricate mapping from acoustic signals to articulatory features effectively.

Implications and Future Prospects

The presented work not only enhances the understanding of the interplay between nasalance and velar constriction but also offers a robust methodological framework for non-invasive speech articulation characterization. The implications of this research are manifold, particularly in speech therapy, automated accent assessment, and other linguistic technologies that can benefit from detailed articulatory insight without direct measurement tools.

Future research may further refine these models by incorporating larger and more varied datasets and exploring additional acoustic features that could enhance the robustness and reliability of such SI systems. Additionally, given the potential implications for diverse applications, this work may lead to more accessible articulatory models, facilitating research and clinical work in geographically isolated or resource-constrained settings.

By paving the way towards a comprehensive understanding of velopharyngeal articulation through non-invasive measures, this study contributes substantively to the field of speech science, promising practical and theoretical advancements in both AI-driven speech applications and linguistics.

Markdown Report Issue