Unreported processing speeds of prior in-the-wild speech preprocessing pipelines

Determine the processing speeds of previously proposed automatic preprocessing pipelines for in-the-wild speech data, specifically AutoPrep and WenetSpeech4TTS, whose efficiencies were not reported, to enable fair comparisons with the open-source Emilia-Pipe in terms of throughput and scalability.

Background

The paper contrasts Emilia-Pipe with earlier automatic preprocessing pipelines for in-the-wild speech data such as AutoPrep and WenetSpeech4TTS. These prior pipelines rely on proprietary models and are not publicly available, which limits accessibility and comparability.

A specific unresolved point highlighted by the authors is that the processing speeds of these prior pipelines were not reported. This prevents direct comparison of efficiency and scalability with Emilia-Pipe, which the authors benchmark extensively.

References

While previous works propose automatic preprocessing pipelines to address these issues, they rely heavily on proprietary models, making their pipelines less accessible to the broader community. Additionally, the processing speed of these pipelines remains unknown.

Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation  (2407.05361 - He et al., 2024) in Section 1 (Introduction)