AudioPaLM: A Large Language Model That Can Speak and Listen
Abstract: We introduce AudioPaLM, a LLM for speech understanding and generation. AudioPaLM fuses text-based and speech-based LLMs, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text LLMs such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only LLM improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio LLMs, such as transferring a voice across languages based on a short spoken prompt. We release examples of our method at https://google-research.github.io/seanet/audiopalm/examples
- Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Common voice: A massively-multilingual speech corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218–4222, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL https://aclanthology.org/2020.lrec-1.520.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
- mslam: Massively multilingual joint pre-training for speech and text. arXiv preprint arXiv:2202.01374, 2022.
- Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 1–61. Association for Computational Linguistics, 2019. URL https://aclanthology.org/W19-5301.
- Findings of the 2020 conference on machine translation (WMT20). In Proceedings of the Fifth Conference on Machine Translation, pages 1–55. Association for Computational Linguistics, 2020. URL https://aclanthology.org/2020.wmt-1.1.
- Findings of the 2013 Workshop on Statistical Machine Translation. In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 1–44. Association for Computational Linguistics, 2013. URL https://aclanthology.org/W13-2201.
- Findings of the 2015 workshop on statistical machine translation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 1–46. Association for Computational Linguistics, 2015. URL https://aclanthology.org/W15-3001.
- Findings of the 2017 conference on machine translation (WMT17). In Proceedings of the Second Conference on Machine Translation, pages 169–214. Association for Computational Linguistics, 2017. URL https://aclanthology.org/W17-4717.
- Findings of the 2018 conference on machine translation (WMT18). In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 272–303. Association for Computational Linguistics, 2018. URL https://aclanthology.org/W18-6401.
- AudioLM: a language modeling approach to audio generation. arXiv preprint arXiv:2209.03143, 2022.
- Soundstorm: Efficient parallel audio generation. arXiv preprint arXiv:2305.09636, 2023.
- Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
- Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process., 16(6):1505–1518, 2022a. doi: 10.1109/JSTSP.2022.3188113. URL https://doi.org/10.1109/JSTSP.2022.3188113.
- PaLI: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022b.
- Uniter: Universal image-text representation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX, pages 104–120. Springer, 2020.
- Maestro: Matched speech text representations through modality matching. arXiv preprint arXiv:2204.03409, 2022c.
- Self-supervised learning with random-projection quantizer for speech recognition. In International Conference on Machine Learning, pages 3915–3924. PMLR, 2022.
- PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- W2V-Bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In ASRU, 2021.
- Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 798–805. IEEE, 2023.
- High fidelity neural audio compression. CoRR, abs/2210.13438, 2022. doi: 10.48550/arXiv.2210.13438. URL https://doi.org/10.48550/arXiv.2210.13438.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Singsong: Generating musical accompaniments from singing. CoRR, abs/2301.12662, 2023. doi: 10.48550/arXiv.2301.12662. URL https://doi.org/10.48550/arXiv.2301.12662.
- Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681, 2021.
- Low-resource speech recognition and keyword-spotting. In Speech and Computer: 19th International Conference, SPECOM 2017, Hatfield, UK, September 12-16, 2017, Proceedings 19, pages 3–19. Springer, 2017.
- Large-scale adversarial training for vision-and-language representation learning. Advances in Neural Information Processing Systems, 33:6616–6628, 2020.
- Textually pretrained speech language models. arXiv preprint arXiv:2305.13009, 2023.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
- Leveraging weakly supervised data to improve end-to-end speech-to-text translation. In Proc. ICASSP, pages 7180–7184, 2019a.
- Direct speech-to-speech translation with a sequence-to-sequence model. In INTERSPEECH, 2019b.
- Png bert: Augmented bert on phonemes and graphemes for neural tts. Proc. Interspeech 2021, pages 151–155, 2021.
- Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation. arXiv preprint arXiv:2203.13339, 2022a.
- Translatotron 2: High-quality direct speech-to-speech translation with voice preservation. In International Conference on Machine Learning, pages 10120–10134. PMLR, 2022b.
- Cvss corpus and massively multilingual speech-to-speech translation. arXiv preprint arXiv:2201.03713, 2022c.
- Transformer-based direct speech-to-speech translation with transcoder. In Proc. IEEE SLT, pages 958–965, 2021.
- Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. arXiv preprint arXiv:2302.03540, 2023.
- Audiogen: Textually guided audio generation. CoRR, abs/2209.15352, 2022. doi: 10.48550/arXiv.2209.15352. URL https://doi.org/10.48550/arXiv.2209.15352.
- T. Kudo and J. Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018a.
- T. Kudo and J. Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In E. Blanco and W. Lu, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018: System Demonstrations, Brussels, Belgium, October 31 - November 4, 2018, pages 66–71. Association for Computational Linguistics, 2018b. doi: 10.18653/v1/d18-2012. URL https://doi.org/10.18653/v1/d18-2012.
- On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021.
- JANUS-III: Speech-to-speech translation in multiple languages. In ICASSP, 1997.
- Textless speech-to-speech translation on real data. arXiv preprint arXiv:2112.08352, 2021.
- Direct speech-to-speech translation with discrete units. In ACL, 2022.
- Direct simultaneous speech to speech translation. arXiv preprint arXiv:2110.08250, 2021.
- The ATR multilingual speech-to-speech translation system. IEEE Transactions on Audio, Speech, and Language Processing, 2006.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
- M. Post. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels, Oct. 2018. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/W18-6319.
- Mls: A large-scale multilingual dataset for speech research. arXiv preprint arXiv:2012.03411, 2020.
- When and why are pre-trained word embeddings useful for neural machine translation? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 529–535. Association for Computational Linguistics, 2018. URL https://aclanthology.org/N18-2084.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356, 2022.
- Scaling language models: Methods, analysis & insights from training gopher. CoRR, abs/2112.11446, 2021. URL https://arxiv.org/abs/2112.11446.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. In IEEE International Conference on Acoustics, Speech and Signal Processing (DNSMOS), 2021.
- Scaling up models and data with t5x and seqio, 2022.
- Improving neural machine translation models with monolingual data. In ACL, 2016.
- Learning audio-visual speech representation by masked multimodal cluster prediction, 2022.
- Speech-to-speech translation between untranscribed unknown languages. In Proc. IEEE ASRU, pages 593–600, 2019.
- Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), pages 5998–6008, 2017.
- W. Wahlster. Verbmobil: Foundations of speech-to-speech translation. Springer, 2000.
- Covost 2: A massively multilingual speech-to-text translation corpus. CoRR, abs/2007.10310, 2020. URL https://arxiv.org/abs/2007.10310.
- Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation, 2021.
- Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023.
- Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022a.
- Joint pre-training with speech and bilingual text for direct speech to speech translation. arXiv:2210.17027, 2022b.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022a.
- Scaling autoregressive models for content-rich text-to-image generation. arXiv:2206.10789, 2022b. doi: 10.48550/arXiv.2206.10789.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- SoundStream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
- UWSpeech: Speech to speech translation for unwritten languages. In AAAI, 2021.
- Google USM: Scaling automatic speech recognition beyond 100 languages. arXiv preprint arXiv:2303.01037, 2023a.
- Speak foreign languages with your own voice: Cross-lingual neural codec language modeling. arXiv preprint arXiv:2303.03926, 2023b.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.