Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Published 14 Nov 2023 in eess.AS, cs.CL, and cs.LG | (2311.07919v2)

Abstract: Recently, instruction-following audio-LLMs have received broad attention for audio interaction with humans. However, the absence of pre-trained audio models capable of handling diverse audio types and tasks has hindered progress in this field. Consequently, most existing works have only been able to support a limited range of interaction capabilities. In this paper, we develop the Qwen-Audio model and address this limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types, such as human speech, natural sounds, music, and songs, to facilitate universal audio understanding abilities. However, directly co-training all tasks and datasets can lead to interference issues, as the textual labels associated with different datasets exhibit considerable variations due to differences in task focus, language, granularity of annotation, and text structure. To overcome the one-to-many interference, we carefully design a multi-task training framework by conditioning on a sequence of hierarchical tags to the decoder for encouraging knowledge sharing and avoiding interference through shared and specified tags respectively. Remarkably, Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Building upon the capabilities of Qwen-Audio, we further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios.

Abstract PDF HTML Upgrade to Chat

References (71)

Citations (180)

View on Semantic Scholar

Summary

The paper introduces a unified model that integrates an audio encoder with a large language model to overcome task-specific limitations.
It implements a hierarchical tag-based conditioning framework for multi-task training, achieving robust performance across varied benchmarks.
The model demonstrates state-of-the-art accuracy in speech recognition and audio analysis, advancing universal audio understanding and cross-modal interactions.

Advancing Universal Audio Understanding via Qwen-Audio

The paper "Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-LLMs" introduces a model designed to enhance audio interaction capabilities by comprehensively perceiving and understanding various audio types. This model was devised to address limitations in existing pre-trained audio models, which are typically constrained to specific tasks or audio types. The introduction of Qwen-Audio signals a significant shift toward unifying audio-language pre-training for diverse tasks and audio types, offering expanded capabilities without necessitating task-specific fine-tuning.

Model Architecture

Qwen-Audio consists of an audio encoder and a LLM. The audio encoder, based on the Whisper-large-v2 model, processes diverse audio inputs, converting waveforms into mel-spectrograms and further encoding them through a Transformer architecture. The LLM, initialized from Qwen-7B, is a 32-layer Transformer decoder model responsible for generating text sequences conditioned on audio representations (Figure 1).

Figure 1: Overview of Qwen-Audio architecture and multitask-pretraining.

This strategic integration enables the unified processing of multiple audio modalities, facilitating a broad spectrum of audio-language tasks.

Multi-task Training Framework

To manage interference in multi-task training—arising from variations in textual labels across datasets—the paper proposes a hierarchical tag-based conditioning framework. This includes transcription, language, task, text language, timestamp tags, and output instructions to optimize multitask learning. Particularly salient is the introduction of Speech Recognition with Word-level Timestamps (SRWT) which provides fine-grained timestamp prediction, crucial for grounding audio signals and improving distinct tasks.

Evaluation Results

The evaluation of Qwen-Audio was conducted over twelve datasets covering Automatic Speech Recognition (ASR), Speech-to-Text Translation (S2TT), and other audio analysis tasks such as sound classification and music note analysis. The results indicate superior performance compared to existing models, notably achieving state-of-the-art results in multiple benchmarks without fine-tuning, including the Aishell1 and CoVoST2 datasets.

Figure 2: Performance of Qwen-Audio compared to top-tier models.

Qwen-Audio's comprehensive training framework effectively leverages shared tags for knowledge transfer, yielding high accuracy and robust performance across tasks.

Implications and Future Directions

The introduction of Qwen-Audio offers significant implications for AI's capacity to understand and interact with audio. Future advancements may enhance model adaptivity, efficiency, and cross-modal integration, potentially influencing AGI development. Qwen-Audio's open-source nature will likely foster collaborative growth within the audio-text multimodal community, spurring further innovations in universal audio comprehension.

Conclusion

Qwen-Audio represents a marked progression in audio-LLMs, exhibiting versatility in handling a diverse array of audio types and tasks. Its hierarchical multitask training framework addresses interference issues, facilitating effective knowledge sharing. As a foundation, Qwen-Audio and its interactive variant, Qwen-Audio-Chat, promote universal understanding and multi-turn dialogue interactions, setting a precedent for future development in multimodal AI systems.