WavLLM: Towards Robust and Adaptive Speech Large Language Model

Published 31 Mar 2024 in cs.CL, cs.AI, cs.SD, and eess.AS | (2404.00656v3)

Abstract: The recent advancements in LLMs have revolutionized the field of natural language processing, progressively broadening their scope to multimodal perception and generation. However, effectively integrating listening capabilities into LLMs poses significant challenges, particularly with respect to generalizing across varied contexts and executing complex auditory tasks. In this work, we introduce WavLLM, a robust and adaptive speech LLM with dual encoders, and a prompt-aware LoRA weight adapter, optimized by a two-stage curriculum learning approach. Leveraging dual encoders, we decouple different types of speech information, utilizing a Whisper encoder to process the semantic content of speech, and a WavLM encoder to capture the unique characteristics of the speaker's identity. Within the curriculum learning framework, WavLLM first builds its foundational capabilities by optimizing on mixed elementary single tasks, followed by advanced multi-task training on more complex tasks such as combinations of the elementary tasks. To enhance the flexibility and adherence to different tasks and instructions, a prompt-aware LoRA weight adapter is introduced in the second advanced multi-task training stage. We validate the proposed model on universal speech benchmarks including tasks such as ASR, ST, SV, ER, and also apply it to specialized datasets like Gaokao English listening comprehension set for SQA, and speech Chain-of-Thought (CoT) evaluation set. Experiments demonstrate that the proposed model achieves state-of-the-art performance across a range of speech tasks on the same model size, exhibiting robust generalization capabilities in executing complex tasks using CoT approach. Furthermore, our model successfully completes Gaokao tasks without specialized training. The codes, models, audio, and Gaokao evaluation set can be accessed at \url{aka.ms/wavllm}.

Abstract PDF HTML Upgrade to Chat

References (35)

Citations (27)

View on Semantic Scholar

Summary

The paper introduces a dual encoder architecture that integrates Whisper for semantic content and WavLM for acoustic features to enhance multimodal processing.
The paper employs a two-stage curriculum learning strategy, starting with basic speech tasks and advancing to complex multi-task chain-of-thought reasoning.
The paper demonstrates state-of-the-art ASR performance with a 2.0% WER and improved adaptability through a prompt-aware LoRA tuning mechanism.

Overview of "WavLLM: Towards Robust and Adaptive Speech LLM"

The paper introduces WavLLM, a speech-integrated LLM that aims to enhance the modeling of speech capabilities within LLMs for improved multimodal understanding and task execution. The research delineates a novel architecture that incorporates dual encoders and a prompt-aware Low-Rank Adaptation (LoRA) weight adapter, optimized through a two-stage curriculum learning approach, to accomplish robust and versatile auditory task processing.

Methodology

Model Architecture:

Dual Encoders: WavLLM integrates the Whisper and WavLM encoders to separately process different facets of the speech signal. Whisper primarily captures semantic content, while WavLM focuses on the speaker’s acoustic attributes such as identity.
Prompt-aware LoRA Adapter: This component is introduced to modulate model parameters in response to varying instruction prompts, thereby enhancing adaptability and performance across diverse input types.

Training Approach:

Two-Stage Curriculum Learning: The training regimen begins with simpler tasks, aiming to build foundational capabilities by leveraging synthesized spoken question-answering datasets and other elementary speech tasks such as automatic speech recognition (ASR), speech translation (ST), speaker verification (SV), and emotion recognition (ER).
The second stage involves advanced multi-task learning with complex task combinations and integrates a prompt-aware adapter to further refine the execution across mixed tasks and instructions.

Experimental Evaluation

The model is evaluated on a range of single and multiple task benchmarks:

Single-Task Performance: WavLLM demonstrates state-of-the-art results on tasks like ASR, with a WER of 2.0% on the LibriSpeech test-clean dataset, outperforming comparable models.
Task Flexibility and Chain-of-Thought (CoT) Processing: The model exhibits strong performance in handling tasks requiring CoT reasoning, leveraging the capability to decompose complex tasks into manageable sub-tasks, fostering efficient problem-solving.

Discussion and Implications

WavLLM showcases a compelling advancement in merging speech and language understanding, offering a more nuanced capability to generalize across various auditory and textual tasks. Its adaptable architecture presents potential use cases extending beyond speech transcription to include complex dialogues and multilingual task executions, underscoring its applicability in real-world voice-based AI applications.

The separation of speech encoding into semantic and acoustic content via dual encoders might signal a trend towards more compartmentalized processing architectures within multimodal models, paving the way for custom-tailored solutions in diverse application areas. The introduction of prompt-aware tuning mechanisms also highlights a growing recognition of the importance of contextual adaptation in enhancing model performance and reliability.

Future Directions

Future research could explore further optimizing the interplay between semantic and acoustic encoders, perhaps through more advanced fusion techniques. Additionally, exploring the extension of such models to cover more diverse languages and dialects, as well as the incorporation of real-time adaptability in dynamic environments, will be crucial. Integrating generators for audio output might also offer enhanced interactivity between users and models, creating a more seamless multimodal communication interface.

Overall, WavLLM highlights the ongoing evolution and potential of integrating speech processing capabilities into LLMs, with promising implications for both theoretical advancements and practical applications in the AI domain.