WavLLM: Towards Robust and Adaptive Speech Large Language Model
Abstract: The recent advancements in LLMs have revolutionized the field of natural language processing, progressively broadening their scope to multimodal perception and generation. However, effectively integrating listening capabilities into LLMs poses significant challenges, particularly with respect to generalizing across varied contexts and executing complex auditory tasks. In this work, we introduce WavLLM, a robust and adaptive speech LLM with dual encoders, and a prompt-aware LoRA weight adapter, optimized by a two-stage curriculum learning approach. Leveraging dual encoders, we decouple different types of speech information, utilizing a Whisper encoder to process the semantic content of speech, and a WavLM encoder to capture the unique characteristics of the speaker's identity. Within the curriculum learning framework, WavLLM first builds its foundational capabilities by optimizing on mixed elementary single tasks, followed by advanced multi-task training on more complex tasks such as combinations of the elementary tasks. To enhance the flexibility and adherence to different tasks and instructions, a prompt-aware LoRA weight adapter is introduced in the second advanced multi-task training stage. We validate the proposed model on universal speech benchmarks including tasks such as ASR, ST, SV, ER, and also apply it to specialized datasets like Gaokao English listening comprehension set for SQA, and speech Chain-of-Thought (CoT) evaluation set. Experiments demonstrate that the proposed model achieves state-of-the-art performance across a range of speech tasks on the same model size, exhibiting robust generalization capabilities in executing complex tasks using CoT approach. Furthermore, our model successfully completes Gaokao tasks without specialized training. The codes, models, audio, and Gaokao evaluation set can be accessed at \url{aka.ms/wavllm}.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Video generation models as world simulators. 2024. URL https://openai.com/research/video-generation-models-as-world-simulators.
- IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42:335–359, 2008.
- The AMI meeting corpus: A pre-announcement. In International workshop on machine learning for multimodal interaction, pages 28–39. Springer, 2005.
- Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022.
- Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919, 2023.
- The Fisher corpus: A resource for the next generations of speech-to-text. In LREC, volume 4, pages 69–71, 2004.
- MuTual: A dataset for multi-turn dialogue reasoning. arXiv preprint arXiv:2004.04494, 2020.
- Pengi: An audio language model for audio tasks. arXiv e-prints, pages arXiv–2305, 2023.
- Must-c: a multilingual speech translation corpus. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2012–2017. Association for Computational Linguistics, 2019.
- Switchboard: Telephone speech corpus for research and development. In ICASSP, volume 1, pages 517–520. IEEE Computer Society, 1992.
- English gigaword. Linguistic Data Consortium, Philadelphia, 4(1):34, 2003.
- Boosting large language model for speech synthesis: An empirical study. arXiv preprint arXiv:2401.00246, 2023.
- Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
- Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context. arXiv preprint arXiv:2309.08105, 2023.
- Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612, 2017.
- OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Generating images in context with multimodal large language models. In The Twelfth International Conference on Learning Representations, 2023.
- Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015. doi: 10.1109/ICASSP.2015.7178964.
- Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR, 2023.
- Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925, 2023.
- Llasm: Large language and speech model. arXiv preprint arXiv:2308.15930, 2023.
- Salmonn: Towards generic hearing abilities for large language models. arXiv preprint arXiv:2310.13289, 2023.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of machine learning research, 9(11), 2008.
- Covost 2 and massively multilingual speech-to-text translation. arXiv preprint arXiv:2007.10310, 2020.
- On decoder-only architecture for speech-to-text and large language model integration. arXiv preprint arXiv:2307.03917, 2023.
- WikiQA: A challenge dataset for open-domain question answering. In LluĂs MĂ rquez, Chris Callison-Burch, and Jian Su, editors, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2013–2018, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1237. URL https://aclanthology.org/D15-1237.
- Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. arXiv preprint arXiv:2305.11000, 2023.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
- Minigpt-5: Interleaved vision-and-language generation via generative vokens. arXiv preprint arXiv:2310.02239, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.