Papers
Topics
Authors
Recent
Search
2000 character limit reached

Voice Attribute Editing with Text Prompt

Published 13 Apr 2024 in cs.SD, cs.AI, and eess.AS | (2404.08857v2)

Abstract: Despite recent advancements in speech generation with text prompt providing control over speech style, voice attributes in synthesized speech remain elusive and challenging to control. This paper introduces a novel task: voice attribute editing with text prompt, with the goal of making relative modifications to voice attributes according to the actions described in the text prompt. To solve this task, VoxEditor, an end-to-end generative model, is proposed. In VoxEditor, addressing the insufficiency of text prompt, a Residual Memory (ResMem) block is designed, that efficiently maps voice attributes and these descriptors into the shared feature space. Additionally, the ResMem block is enhanced with a voice attribute degree prediction (VADP) block to align voice attributes with corresponding descriptors, addressing the imprecision of text prompt caused by non-quantitative descriptions of voice attributes. We also establish the open-source VCTK-RVA dataset, which leads the way in manual annotations detailing voice characteristic differences among different speakers. Extensive experiments demonstrate the effectiveness and generalizability of our proposed method in terms of both objective and subjective metrics. The dataset and audio samples are available on the website.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. GPU accelerated t-distributed stochastic neighbor embedding. J. Parallel Distributed Comput., 131:1–13.
  2. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process., 16(6):1505–1518.
  3. Cross-modal memory networks for radiology report generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 5904–5914. Association for Computational Linguistics.
  4. ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, pages 3830–3834. ISCA.
  5. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics.
  6. Prompttts: Controllable text-to-speech with text descriptions. In IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023, pages 1–5. IEEE.
  7. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  8. Textrolspeech: A text style control speech corpus with codec language text-to-speech models. CoRR, abs/2308.14430.
  9. Imagic: Text-based real image editing with diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 6007–6017. IEEE.
  10. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  11. Prompttts 2: Describing and generating voices with text prompt. CoRR, abs/2309.02285.
  12. Freevc: Towards high-quality text-free one-shot voice conversion. In IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023, pages 1–5. IEEE.
  13. Promptstyle: Controllable style transfer for text-to-speech with natural language descriptions. CoRR, abs/2305.19522.
  14. Seyed Hamidreza Mohammadi and Alexander Kain. 2017. An overview of voice conversion systems. Speech Commun., 88:65–82.
  15. Face-driven zero-shot voice conversion with memory-based face-voice alignment. In Proceedings of the 31st ACM International Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29 October 2023- 3 November 2023, pages 8443–8452. ACM.
  16. Prompttts++: Controlling speaker identity in prompt-based text-to-speech using natural language descriptions. CoRR, abs/2309.08140.
  17. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit.
  18. Zachary Wallmark and Roger A Kendall. 2018. Describing sound: The cognitive linguistics of timbre.
  19. COCO-NUT: corpus of japanese utterance and voice characteristics description for prompt-based control. In IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023, Taipei, Taiwan, December 16-20, 2023, pages 1–8. IEEE.
  20. Memory networks. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
  21. Instructtts: Modelling expressive TTS in discrete latent space with natural language style prompt. CoRR, abs/2301.13662.
  22. Promptvc: Flexible stylistic voice conversion in latent space driven by natural language prompts. CoRR, abs/2309.09262.
  23. Promptspeaker: Speaker generation based on text descriptions. In IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2023, Taipei, Taiwan, December 16-20, 2023, pages 1–7. IEEE.
  24. Content-dependent fine-grained speaker embedding for zero-shot speaker adaptation in text-to-speech synthesis. In Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, pages 2573–2577. ISCA.
Citations (2)

Summary

  • The paper introduces VoxEditor, an end-to-end model for editing voice attributes via text prompts.
  • It employs innovative modules like the Residual Memory block and VADP to accurately map and adjust qualitative voice features.
  • Experimental evaluations on the VCTK-RVA dataset demonstrate significant enhancements in target voice attribute similarity.

Voice Attribute Editing with Text Prompt

Introduction to Voice Attribute Editing

The paper "Voice Attribute Editing with Text Prompt" introduces a novel task aimed at refining voice characteristics in synthesized speech through natural language cues. The primary goal is to achieve relative modifications to voice attributes - qualitative elements like "husky" or "bright" - dictated by textual prompts. This task stands apart from traditional voice conversion (VC), as it relies on text instead of reference audio to control voice attributes. This approach offers a practical solution for applications such as personalized voice creation, where finding specific reference audio is often challenging. Figure 1

Figure 1: Illustration of voice attribute editing with text prompt.

Methodology: VoxEditor

The proposed solution, VoxEditor, is an end-to-end generative model designed to address the insufficiencies and imprecision inherent in text prompts. It employs a novel Residual Memory (ResMem) block, alongside a voice attribute degree prediction (VADP) module, to align the text-provided voice attributes with their corresponding descriptors effectively.

Residual Memory (ResMem) Block: This component is key to mapping voice attributes into a shared feature space, compensating for aspects difficult to describe in text. It consists of a main memory which quantizes common characteristics and a residual memory that captures subtle nuances.

VADP Block: This module predicts the degree of difference in voice attributes between speakers, thus addressing the qualitative nature of voice attribute descriptors drawn from text. Figure 2 *Figure 2: The overall flowchart of our proposed VoxEditor. During the training process, two speech segments (SpeechA and SpeechB) are used, along with voice attribute descriptor x. In the inference process, the model takes source speech and the text prompt as inputs to generate edited speech. Here Mel denotes the Mel spectrograms, Linear denotes the linear spectrograms, $\bm{s$

Dataset and Experimental Validation

An essential contribution of this research is the creation of the VCTK-RVA dataset, which includes manually annotated voice characteristic differences between speakers. This dataset facilitates the alignment of qualitative voice attributes with quantitative descriptors.

Extensive experiments, utilizing both objective and subjective metrics, demonstrate VoxEditor's effectiveness. Results revealed that VoxEditor can produce high-quality speech that aligns closely with input text prompts while retaining the source speech's voice characteristics.

Evaluation and Results

Numerical evaluations showed significant improvements in metrics such as TVAS (Target Voice Attribute Similarity) when using VoxEditor compared to existing methods like PromptStyle. These results validate the model's ability to generate speech with precisely edited voice attributes corresponding to text prompts. Figure 3

Figure 3: The variation of the TAVS metric for generated speech edited with different attributes under various values of editing degree alpha.

Visualizations and User Study

Visual analyses, such as t-SNE visualizations of speaker embeddings, further demonstrated VoxEditor's ability to cluster edited speech with distinct voice attributes. Additionally, user studies confirmed that an editing degree (alpha) between 0.6 and 0.8 yields the most compelling balance of attribute modification and voice characteristic retention. Figure 4

Figure 4: The t-SNE visualization of the speaker embeddings extracted from generated speech edited with different attributes under various values of alpha.

Figure 5

Figure 5: MOS-Cons and MOS-Corr scores with varying editing degrees alpha. Edited speech tends to match both the source speech and text prompt in the highlighted area.

Conclusion

VoxEditor represents a significant advancement in the field of AI-driven voice synthesis, offering a robust system for editing voice attributes via textual commands. Future research directions could include expanding the dataset and enhancing model capabilities to encompass a wider range of voice attribute adjustments. The outlined limitations suggest pathways for further refinement, particularly in addressing decreased performance in unseen conditions and improving dataset annotations.

The development of VoxEditor not only enhances the flexibility of voice editing tasks but also sets a precedent for employing natural language as a control mechanism in voice synthesis, highlighting the intersection of AI and linguistic descriptions in sophisticated audio processing tasks.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.