Papers
Topics
Authors
Recent
Search
2000 character limit reached

SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation

Published 28 May 2024 in cs.SD, cs.LG, and eess.AS | (2405.18503v3)

Abstract: Sound content creation, essential for multimedia works such as video games and films, often involves extensive trial-and-error, enabling creators to semantically reflect their artistic ideas and inspirations, which evolve throughout the creation process, into the sound. Recent high-quality diffusion-based Text-to-Sound (T2S) generative models provide valuable tools for creators. However, these models often suffer from slow inference speeds, imposing an undesirable burden that hinders the trial-and-error process. While existing T2S distillation models address this limitation through 1-step generation, the sample quality of $1$-step generation remains insufficient for production use. Additionally, while multi-step sampling in those distillation models improves sample quality itself, the semantic content changes due to their lack of deterministic sampling capabilities. To address these issues, we introduce Sound Consistency Trajectory Models (SoundCTM), which allow flexible transitions between high-quality $1$-step sound generation and superior sound quality through multi-step deterministic sampling. This allows creators to efficiently conduct trial-and-error with 1-step generation to semantically align samples with their intention, and subsequently refine sample quality with preserving semantic content through deterministic multi-step sampling. To develop SoundCTM, we reframe the CTM training framework, originally proposed in computer vision, and introduce a novel feature distance using the teacher network for a distillation loss. For production-level generation, we scale up our model to 1B trainable parameters, making SoundCTM-DiT-1B the first large-scale distillation model in the sound community to achieve both promising high-quality 1-step and multi-step full-band (44.1kHz) generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Brian. D. O. Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12:313–326, 1982.
  2. Accelerating diffusion-based text-to-audio generation with consistency distillation. arXiv preprint arXiv:2309.10740, 2023.
  3. Foley sound synthesis at the dcase 2023 challenge. In arXiv e-prints: 2304.12521, 2023.
  4. Scaling instruction-finetuned language models, 2022.
  5. Bradley Efron. Tweedie’s formula and selection bias. Journal of the American Statistical Association, 106:1602 – 1614, 2011.
  6. Fast timing-conditioned latent audio diffusion. arXiv preprint arXiv:2402.04825, 2024.
  7. Text-to-audio generation using instruction tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731, 2023.
  8. Bigvgan: A universal neural vocoder with large-scale training. In Proc. International Conference on Learning Representation (ICLR), 2023.
  9. Generative adversarial networks. Proc. Advances in Neural Information Processing Systems (NeurIPS), 2014.
  10. Great Big Story. The magic of making sound, 2017. URL https://www.youtube.com/watch?v=UO3N_PRIgX0.
  11. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  12. Cnn architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 131–135, 2017.
  13. Classifier-free diffusion guidance. In arXiv e-prints: 2207.12598, 2022.
  14. Make-an-audio 2: Temporal-enhanced text-to-audio generation. arXiv preprint arXiv:2305.18474, 2023a.
  15. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. arXiv preprint arXiv:2301.12661, 2023b.
  16. Taming visually guided sound generation. In British Machine Vision Conference (BMVC), 2021.
  17. Elucidating the design space of diffusion-based generative models. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2022.
  18. Fréchet audio distance: A metric for evaluating music enhancement algorithms. arXiv preprint arXiv:1812.08466, 2019.
  19. AudioCaps: Generating Captions for Audios in The Wild. In NAACL-HLT, 2019.
  20. Consistency trajectory models: Learning probability flow ode trajectory of diffusion. In Proc. International Conference on Learning Representation (ICLR), 2024.
  21. Adam: A method for stochastic optimization. In Proc. International Conference on Learning Representation (ICLR), 2017.
  22. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2020.
  23. Efficient training of audio transformers with patchout. In Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, pages 2753–2757, 2022.
  24. Audiogen: Textually guided audio generation. Proc. International Conference on Learning Representation (ICLR), 2023.
  25. High-fidelity audio compression with improved rvqgan. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2023.
  26. Controllable music production with diffusion models and guidance gradients. arXiv preprint arXiv:2311.00613, 2023.
  27. AudioLDM: Text-to-audio generation with latent diffusion models. Proc. International Conference on Machine Learning (ICML), 2023a.
  28. AudioLDM 2: Learning holistic audio generation with self-supervised pretraining. arXiv preprint arXiv:2308.05734, 2023b.
  29. On the variance of the adaptive learning rate and beyond. In Proc. International Conference on Learning Representation (ICLR), April 2020.
  30. Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2023.
  31. Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization. arXiv preprint arXiv:2404.09956, 2024.
  32. Ditto: Diffusion inference-time t-optimization for music generation. arXiv preprint arXiv:2401.12179, 2024.
  33. High-resolution image synthesis with latent diffusion models. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  34. U-net: Convolutional networks for biomedical image segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi, editors, Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241, 2015.
  35. Improved techniques for training gans. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2016.
  36. Smoothing and differentiation of data by simplified least squares procedures. Analytical Chemistry, 36(8):1627–1639, 1964.
  37. BigVSAN: Enhancing gan-based neural vocoders with slicing adversarial network. In IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2024.
  38. Deep unsupervised learning using nonequilibrium thermodynamics. In Proc. International Conference on Machine Learning (ICML), 2015.
  39. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  40. Score-based generative modeling through stochastic differential equations. In Proc. International Conference on Learning Representation (ICLR), 2021.
  41. Consistency models. Proc. International Conference on Machine Learning (ICML), 2023.
  42. Any-to-any generation via composable diffusion. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2023.
  43. Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674, 2011.
  44. WIRED. How this woman creates god of war’s sound effects, 2023. URL https://www.youtube.com/watch?v=WFVLWo5B81w.
  45. Music controlnet: Multiple time-varying controls for music generation. arXiv preprint arXiv:2311.07069, 2023.
  46. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2023.
  47. Freedom: Training-free energy-guided conditional diffusion model. Proc. IEEE International Conference on Computer Vision (ICCV), 2023.
  48. The unreasonable effectiveness of deep features as a perceptual metric. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  49. Masked audio generation using a single non-autoregressive transformer. arXiv preprint arXiv:2401.04577, 2024.
Citations (3)

Summary

  • The paper introduces SoundCTM, a novel framework that integrates score-based and consistency models to achieve rapid one-step and refined multi-step text-to-sound generation.
  • The paper utilizes a distillation loss based on teacher network feature distances and classifier-free guided trajectories to optimize both inference speed and output quality.
  • The paper demonstrates that SoundCTM attains a one-step FAD of 2.17 and enables real-time, training-free controllable sound synthesis across multiple platforms.

Sound Consistency Trajectory Models (SoundCTM)

The paper introduces Sound Consistency Trajectory Models (SoundCTM), a novel approach for text-to-sound (T2S) generation aimed at addressing the high inference latency typically associated with diffusion-based sound generation models. SoundCTM enables flexible transitioning between high-quality one-step sound generation and superior multi-step sound generation, providing creators with an efficient and versatile tool for real-time sound synthesis.

Background and Challenges

Recent advancements in diffusion-based models have demonstrated significant promise in generating high-quality sounds for multimedia applications. However, the iterative sampling process inherent in these models results in slow inference speeds. This latency is particularly burdensome for sound creators who require rapid feedback to refine and align sounds with their artistic intentions. Addressing the slow inference problem is crucial for making these models more practical and appealing to sound creators.

SoundCTM: A Novel Framework

SoundCTM offers a solution by allowing flexible switching between one-step high-quality sound generation and higher-quality multi-step generation. The framework introduces several innovations:

  1. Feature Distance from Teacher's Network: To improve the performance without the need for expensive pretrained feature extractors or adversarial loss, SoundCTM utilizes the teacher's network to derive a novel feature distance for a distillation loss, optimizing memory usage and performance.
  2. Classifier-Free Guided Trajectories: The framework distills classifier-free guided text-conditional trajectories, simultaneously training conditional and unconditional student models.
  3. Interpolation During Inference: During sampling, SoundCTM leverages a new scaling term to interpolate between text-conditional and unconditional neural jumps, enhancing the flexibility and quality of the generated sounds.

Experimental Results

The paper reports comprehensive experiments demonstrating SoundCTM's effectiveness across various metrics such as Frechet Audio Distance (FAD), Inception Score (IS), and CLAP score. Key findings include:

  • High-Quality One-Step Generation: SoundCTM's one-step generation achieves a FAD of 2.17, outperforming other models like ConsistencyTTA.
  • Flexible Multi-Step Generation: With 16-step sampling, SoundCTM achieves superior performance, showcasing FAD improvements and real-time generation capabilities on both GPU and CPU platforms.
  • Training-Free Controllable Generation: SoundCTM supports training-free controllable sound generation, leveraging its anytime-to-anytime jump capability to optimize initial noise with significant efficiency.

Implications and Future Developments

The introduction of SoundCTM holds several implications for both practical applications and theoretical developments in sound generation:

  1. Real-Time Sound Synthesis: By addressing the issue of slow inference, SoundCTM can significantly enhance the efficiency of sound creation workflows, making it a valuable tool for Foley artists and multimedia content creators.
  2. Versatility Across Modalities: The domain-agnostic nature of the proposed framework suggests potential applicability to other modalities beyond sound, paving the way for broader adoption in multimedia generation tasks.
  3. Dynamic Sound Generation: The ability to achieve real-time dynamic sound generation opens new possibilities for live performances, interactive exhibitions, and immersive video game experiences.

Future research could further explore the integration of SoundCTM with other state-of-the-art models and techniques, as well as potential applications beyond the current scope. Enhancing the interpretability of the generated sounds and improving the robustness of the framework in diverse environments are also promising directions.

In conclusion, SoundCTM presents a significant step forward in the evolution of sound generation models, offering a blend of flexibility, efficiency, and high-quality output. The paper provides valuable insights and practical solutions that address key challenges in the field, making it a notable contribution to the ongoing development of advanced sound synthesis technologies.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 78 likes about this paper.