Determine the cause of poor C50 control in natural language conditioned TTS

Determine why the speech language model trained on the MLS and LibriTTS-R datasets, conditioned solely on natural language descriptions that include recording quality attributes, performs poorly at generating audio with the specified reverberation metric C50 (the ratio of early reflections to late reflections), in contrast to its satisfactory control over other attributes such as mean pitch, pitch variability, speaking rate, and signal-to-noise ratio.

Background

The study labels recording quality using estimated signal-to-noise ratio (SNR) and estimated C50, where C50 measures the ratio of early to late reflections and serves as an indicator of reverberation. These labels are converted into natural language descriptions and used as conditioning for training a single speech LLM to control speaker identity, style, and recording conditions.

In objective evaluations, the authors correlate synthesized audio features with the requested description labels. While the model shows good alignment for mean pitch, pitch standard deviation, speaking rate, and SNR, it fails to match the requested C50 values. The authors explicitly state uncertainty about this failure and call for further investigation, indicating an unresolved question regarding control of reverberation characteristics via natural language prompts.

References

We are unsure as to why the model performs poorly at generating audio with the appropriate C50, and further investigation is required.

Natural language guidance of high-fidelity text-to-speech with synthetic annotations  (2402.01912 - Lyth et al., 2024) in Section 4.1 Objective evaluation