Papers
Topics
Authors
Recent
Search
2000 character limit reached

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Published 14 Nov 2023 in eess.AS, cs.CL, and cs.LG | (2311.07919v2)

Abstract: Recently, instruction-following audio-LLMs have received broad attention for audio interaction with humans. However, the absence of pre-trained audio models capable of handling diverse audio types and tasks has hindered progress in this field. Consequently, most existing works have only been able to support a limited range of interaction capabilities. In this paper, we develop the Qwen-Audio model and address this limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types, such as human speech, natural sounds, music, and songs, to facilitate universal audio understanding abilities. However, directly co-training all tasks and datasets can lead to interference issues, as the textual labels associated with different datasets exhibit considerable variations due to differences in task focus, language, granularity of annotation, and text structure. To overcome the one-to-many interference, we carefully design a multi-task training framework by conditioning on a sequence of hierarchical tags to the decoder for encouraging knowledge sharing and avoiding interference through shared and specified tags respectively. Remarkably, Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Building upon the capabilities of Qwen-Audio, we further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
  2. Spice: Semantic propositional image caption evaluation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14. Springer, 2016.
  3. PaLM 2 technical report. arXiv:2305.10403, 2023.
  4. Anonymous. SALMONN: Towards generic hearing abilities for large language models. In Submitted to The Twelfth International Conference on Learning Representations, 2023. under review.
  5. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing. arXiv:2110.07205, 2021.
  6. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a.
  7. Qwen-VL: A frontier large vision-language model with versatile abilities. CoRR, abs/2308.12966, 2023b.
  8. Language models are few-shot learners. NeurIPS, 2020.
  9. AISHELL-1: an open-source mandarin speech corpus and a speech recognition baseline. In 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment, O-COCOSDA 2017, Seoul, South Korea, November 1-3, 2017. IEEE, 2017.
  10. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv:2306.15195, 2023.
  11. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process., 2022.
  12. Speechnet: A universal modularized model for speech processing tasks. arXiv:2105.03070, 2021.
  13. PaLM: Scaling language modeling with pathways. arXiv:2204.02311, 2022.
  14. High fidelity neural audio compression. arXiv:2210.13438, 2022.
  15. Pengi: An audio language model for audio tasks. CoRR, 2023.
  16. Clotho: an audio captioning dataset. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020. IEEE, 2020.
  17. AISHELL-2: transforming mandarin ASR research into industrial scale. abs/1808.10583, 2018.
  18. CLAP: learning audio concepts from natural language supervision. abs/2206.04769, 2022.
  19. Neural audio synthesis of musical notes with wavenet autoencoders. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, Proceedings of Machine Learning Research. PMLR, 2017.
  20. Funasr: A fundamental end-to-end speech recognition toolkit. CoRR, abs/2305.11013, 2023.
  21. Vocalsound: A dataset for improving human vocal sounds recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022, pages 151–155. IEEE, 2022. doi: 10.1109/ICASSP43922.2022.9746828. URL https://doi.org/10.1109/ICASSP43922.2022.9746828.
  22. Whisper-at: Noise-robust automatic speech recognizers are also strong general audio event taggers. CoRR, abs/2307.03183, 2023a.
  23. Listen, think, and understand. CoRR, abs/2305.10790, 2023b.
  24. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE ACM Trans. Audio Speech Lang. Process., 2021.
  25. Lora: Low-rank adaptation of large language models. arXiv:2106.09685, 2021.
  26. Audiogpt: Understanding and generating speech, music, sound, and talking head. CoRR, abs/2304.12995, 2023.
  27. Cochlscene: Acquisition of acoustic scene data using crowdsourcing. abs/2211.02289, 2022.
  28. Voicebox: Text-guided multilingual universal speech generation at scale. CoRR, 2023.
  29. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  30. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, Proceedings of Machine Learning Research. PMLR, 2023.
  31. Clotho-aqa: A crowdsourced dataset for audio question answering. In 30th European Signal Processing Conference, EUSIPCO 2022, Belgrade, Serbia, August 29 - Sept. 2, 2022. IEEE, 2022.
  32. Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. CoRR, abs/2306.09093, 2023.
  33. Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks. arXiv:2309.07937, 2023.
  34. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 2017, 2017.
  35. DCASE2017 challenge setup: Tasks, datasets and baseline system. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events, DCASE 2017, Munich, Germany, November 16-17, 2017, 2017.
  36. Lms with a voice: Spoken language modeling beyond speech tokens. CoRR, 2023.
  37. Openai. Chatml documents. URL https://github.com/openai/openai-python/blob/main/chatml.md.
  38. OpenAI. Introducing ChatGPT, 2022. URL https://openai.com/blog/chatgpt.
  39. OpenAI. Gpt-4 technical report, 2023.
  40. Training language models to follow instructions with human feedback. NeurIPS, 2022.
  41. Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015. IEEE, 2015.
  42. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002.
  43. Specaugment: A simple data augmentation method for automatic speech recognition. In Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019.
  44. Kosmos-2: Grounding multimodal large language models to the world. arXiv:2306.14824, 2023.
  45. MELD: A multimodal multi-party dataset for emotion recognition in conversations. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers. Association for Computational Linguistics, 2019.
  46. Qwen. Introducing qwen-7b: Open foundation and human-aligned models (of the state-of-the-arts), 2023. URL https://github.com/QwenLM/Qwen-7B.
  47. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, 2023.
  48. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020.
  49. Audiopalm: A large language model that can speak and listen. CoRR.
  50. Hugginggpt: Solving AI tasks with chatgpt and its friends in huggingface. CoRR, abs/2303.17580, 2023.
  51. Achieving timestamp prediction while recognizing with non-autoregressive end-to-end asr model. In National Conference on Man-Machine Speech Communication. Springer, 2023.
  52. Llasm: Large language and speech model. arXiv:2308.15930, 2023.
  53. Generative pretraining in multimodality. arXiv:2307.05222, 2023.
  54. LLaMA: Open and efficient foundation language models. arXiv:2302.13971, 2023a.
  55. Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023b.
  56. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023c.
  57. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 2017.
  58. Cider: Consensus-based image description evaluation. In CVPR, 2015.
  59. Covost 2: A massively multilingual speech-to-text translation corpus. abs/2007.10310, 2020. URL https://arxiv.org/abs/2007.10310.
  60. Blsp: Bootstrapping language-speech pre-training via behavior alignment of continuation writing. arXiv:2309.00916, 2023a.
  61. SLM: bridge the thin gap between speech and text foundation models. abs/2310.00230, 2023b.
  62. Slm: Bridge the thin gap between speech and text foundation models. arXiv:2310.00230, 2023c.
  63. Viola: Unified codec language models for speech recognition, synthesis, and translation. CoRR, 2023d.
  64. Speechx: Neural codec language model as a versatile speech transformer. CoRR, 2023e.
  65. On decoder-only architecture for speech-to-text and large language model integration. abs/2307.03917, 2023a.
  66. Next-gpt: Any-to-any multimodal LLM. CoRR, abs/2309.05519, 2023b.
  67. Soundstream: An end-to-end neural audio codec. IEEE ACM Trans. Audio Speech Lang. Process., 2022.
  68. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. CoRR, abs/2305.11000, 2023a.
  69. Speechtokenizer: Unified speech tokenizer for speech large language models. CoRR, abs/2308.16692, 2023b.
  70. Google usm: Scaling automatic speech recognition beyond 100 languages. CoRR, 2023c.
  71. Mmspeech: Multi-modal multi-task encoder-decoder pre-training for speech recognition. abs/2212.00500, 2022.
Citations (180)

Summary

  • The paper introduces a unified model that integrates an audio encoder with a large language model to overcome task-specific limitations.
  • It implements a hierarchical tag-based conditioning framework for multi-task training, achieving robust performance across varied benchmarks.
  • The model demonstrates state-of-the-art accuracy in speech recognition and audio analysis, advancing universal audio understanding and cross-modal interactions.

Advancing Universal Audio Understanding via Qwen-Audio

The paper "Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-LLMs" introduces a model designed to enhance audio interaction capabilities by comprehensively perceiving and understanding various audio types. This model was devised to address limitations in existing pre-trained audio models, which are typically constrained to specific tasks or audio types. The introduction of Qwen-Audio signals a significant shift toward unifying audio-language pre-training for diverse tasks and audio types, offering expanded capabilities without necessitating task-specific fine-tuning.

Model Architecture

Qwen-Audio consists of an audio encoder and a LLM. The audio encoder, based on the Whisper-large-v2 model, processes diverse audio inputs, converting waveforms into mel-spectrograms and further encoding them through a Transformer architecture. The LLM, initialized from Qwen-7B, is a 32-layer Transformer decoder model responsible for generating text sequences conditioned on audio representations (Figure 1). Figure 1

Figure 1: Overview of Qwen-Audio architecture and multitask-pretraining.

This strategic integration enables the unified processing of multiple audio modalities, facilitating a broad spectrum of audio-language tasks.

Multi-task Training Framework

To manage interference in multi-task training—arising from variations in textual labels across datasets—the paper proposes a hierarchical tag-based conditioning framework. This includes transcription, language, task, text language, timestamp tags, and output instructions to optimize multitask learning. Particularly salient is the introduction of Speech Recognition with Word-level Timestamps (SRWT) which provides fine-grained timestamp prediction, crucial for grounding audio signals and improving distinct tasks.

Evaluation Results

The evaluation of Qwen-Audio was conducted over twelve datasets covering Automatic Speech Recognition (ASR), Speech-to-Text Translation (S2TT), and other audio analysis tasks such as sound classification and music note analysis. The results indicate superior performance compared to existing models, notably achieving state-of-the-art results in multiple benchmarks without fine-tuning, including the Aishell1 and CoVoST2 datasets. Figure 2

Figure 2: Performance of Qwen-Audio compared to top-tier models.

Qwen-Audio's comprehensive training framework effectively leverages shared tags for knowledge transfer, yielding high accuracy and robust performance across tasks.

Implications and Future Directions

The introduction of Qwen-Audio offers significant implications for AI's capacity to understand and interact with audio. Future advancements may enhance model adaptivity, efficiency, and cross-modal integration, potentially influencing AGI development. Qwen-Audio's open-source nature will likely foster collaborative growth within the audio-text multimodal community, spurring further innovations in universal audio comprehension.

Conclusion

Qwen-Audio represents a marked progression in audio-LLMs, exhibiting versatility in handling a diverse array of audio types and tasks. Its hierarchical multitask training framework addresses interference issues, facilitating effective knowledge sharing. As a foundation, Qwen-Audio and its interactive variant, Qwen-Audio-Chat, promote universal understanding and multi-turn dialogue interactions, setting a precedent for future development in multimodal AI systems.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.