Determine the cause of poor C50 control in natural language conditioned TTS
Determine why the speech language model trained on the MLS and LibriTTS-R datasets, conditioned solely on natural language descriptions that include recording quality attributes, performs poorly at generating audio with the specified reverberation metric C50 (the ratio of early reflections to late reflections), in contrast to its satisfactory control over other attributes such as mean pitch, pitch variability, speaking rate, and signal-to-noise ratio.
References
We are unsure as to why the model performs poorly at generating audio with the appropriate C50, and further investigation is required.
— Natural language guidance of high-fidelity text-to-speech with synthetic annotations
(2402.01912 - Lyth et al., 2024) in Section 4.1 Objective evaluation