SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation
Abstract: Sound content creation, essential for multimedia works such as video games and films, often involves extensive trial-and-error, enabling creators to semantically reflect their artistic ideas and inspirations, which evolve throughout the creation process, into the sound. Recent high-quality diffusion-based Text-to-Sound (T2S) generative models provide valuable tools for creators. However, these models often suffer from slow inference speeds, imposing an undesirable burden that hinders the trial-and-error process. While existing T2S distillation models address this limitation through 1-step generation, the sample quality of $1$-step generation remains insufficient for production use. Additionally, while multi-step sampling in those distillation models improves sample quality itself, the semantic content changes due to their lack of deterministic sampling capabilities. To address these issues, we introduce Sound Consistency Trajectory Models (SoundCTM), which allow flexible transitions between high-quality $1$-step sound generation and superior sound quality through multi-step deterministic sampling. This allows creators to efficiently conduct trial-and-error with 1-step generation to semantically align samples with their intention, and subsequently refine sample quality with preserving semantic content through deterministic multi-step sampling. To develop SoundCTM, we reframe the CTM training framework, originally proposed in computer vision, and introduce a novel feature distance using the teacher network for a distillation loss. For production-level generation, we scale up our model to 1B trainable parameters, making SoundCTM-DiT-1B the first large-scale distillation model in the sound community to achieve both promising high-quality 1-step and multi-step full-band (44.1kHz) generation.
- Brian. D. O. Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12:313–326, 1982.
- Accelerating diffusion-based text-to-audio generation with consistency distillation. arXiv preprint arXiv:2309.10740, 2023.
- Foley sound synthesis at the dcase 2023 challenge. In arXiv e-prints: 2304.12521, 2023.
- Scaling instruction-finetuned language models, 2022.
- Bradley Efron. Tweedie’s formula and selection bias. Journal of the American Statistical Association, 106:1602 – 1614, 2011.
- Fast timing-conditioned latent audio diffusion. arXiv preprint arXiv:2402.04825, 2024.
- Text-to-audio generation using instruction tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731, 2023.
- Bigvgan: A universal neural vocoder with large-scale training. In Proc. International Conference on Learning Representation (ICLR), 2023.
- Generative adversarial networks. Proc. Advances in Neural Information Processing Systems (NeurIPS), 2014.
- Great Big Story. The magic of making sound, 2017. URL https://www.youtube.com/watch?v=UO3N_PRIgX0.
- Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- Cnn architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 131–135, 2017.
- Classifier-free diffusion guidance. In arXiv e-prints: 2207.12598, 2022.
- Make-an-audio 2: Temporal-enhanced text-to-audio generation. arXiv preprint arXiv:2305.18474, 2023a.
- Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. arXiv preprint arXiv:2301.12661, 2023b.
- Taming visually guided sound generation. In British Machine Vision Conference (BMVC), 2021.
- Elucidating the design space of diffusion-based generative models. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Fréchet audio distance: A metric for evaluating music enhancement algorithms. arXiv preprint arXiv:1812.08466, 2019.
- AudioCaps: Generating Captions for Audios in The Wild. In NAACL-HLT, 2019.
- Consistency trajectory models: Learning probability flow ode trajectory of diffusion. In Proc. International Conference on Learning Representation (ICLR), 2024.
- Adam: A method for stochastic optimization. In Proc. International Conference on Learning Representation (ICLR), 2017.
- Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Efficient training of audio transformers with patchout. In Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, pages 2753–2757, 2022.
- Audiogen: Textually guided audio generation. Proc. International Conference on Learning Representation (ICLR), 2023.
- High-fidelity audio compression with improved rvqgan. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Controllable music production with diffusion models and guidance gradients. arXiv preprint arXiv:2311.00613, 2023.
- AudioLDM: Text-to-audio generation with latent diffusion models. Proc. International Conference on Machine Learning (ICML), 2023a.
- AudioLDM 2: Learning holistic audio generation with self-supervised pretraining. arXiv preprint arXiv:2308.05734, 2023b.
- On the variance of the adaptive learning rate and beyond. In Proc. International Conference on Learning Representation (ICLR), April 2020.
- Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Tango 2: Aligning diffusion-based text-to-audio generations through direct preference optimization. arXiv preprint arXiv:2404.09956, 2024.
- Ditto: Diffusion inference-time t-optimization for music generation. arXiv preprint arXiv:2401.12179, 2024.
- High-resolution image synthesis with latent diffusion models. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- U-net: Convolutional networks for biomedical image segmentation. In Nassir Navab, Joachim Hornegger, William M. Wells, and Alejandro F. Frangi, editors, Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pages 234–241, 2015.
- Improved techniques for training gans. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2016.
- Smoothing and differentiation of data by simplified least squares procedures. Analytical Chemistry, 36(8):1627–1639, 1964.
- BigVSAN: Enhancing gan-based neural vocoders with slicing adversarial network. In IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2024.
- Deep unsupervised learning using nonequilibrium thermodynamics. In Proc. International Conference on Machine Learning (ICML), 2015.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Score-based generative modeling through stochastic differential equations. In Proc. International Conference on Learning Representation (ICLR), 2021.
- Consistency models. Proc. International Conference on Machine Learning (ICML), 2023.
- Any-to-any generation via composable diffusion. In Proc. Advances in Neural Information Processing Systems (NeurIPS), 2023.
- Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674, 2011.
- WIRED. How this woman creates god of war’s sound effects, 2023. URL https://www.youtube.com/watch?v=WFVLWo5B81w.
- Music controlnet: Multiple time-varying controls for music generation. arXiv preprint arXiv:2311.07069, 2023.
- Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2023.
- Freedom: Training-free energy-guided conditional diffusion model. Proc. IEEE International Conference on Computer Vision (ICCV), 2023.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Masked audio generation using a single non-autoregressive transformer. arXiv preprint arXiv:2401.04577, 2024.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.