Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis

Published 26 Nov 2024 in cs.MM, cs.CV, cs.SD, and eess.AS | (2411.17690v2)

Abstract: The rapid progress of foundation models and LLMs has fueled significantly improvement in the capabilities of machine learning systems that benefit from mutlimodal input data. However, existing multimodal models are predominantly built on top of pre-trained LLMs, which can limit accurate modeling of temporal dependencies across other modalities and thus limit the model's ability to jointly process and leverage multimodal inputs. To specifically investigate the alignment of text, video, and speech modalities in LLM-style (decoder-only) models, we consider a simplified multimodal generation task, Video-Text to Speech (VTTS): speech generation conditioned on both its corresponding text and video of talking people. The ultimate goal is to generate speech that not only follows the text but also aligns temporally with the video and is consistent with the facial expressions. In this paper, we first introduce Visatronic, a unified multimodal decoder-only transformer model that adopts an LLM-style architecture to embed visual, textual, and speech inputs into a shared subspace, treating all modalities as temporally aligned token streams. Next, we carefully explore different token mixing strategies to understand the best way to propagate information from the steps where video and text conditioning is input to the steps where the audio is generated. We extensively evaluate Visatronic on the challenging VoxCeleb2 dataset and demonstrate zero-shot generalization to LRS3, where Visatronic, trained on VoxCeleb2, achieves a 4.5% WER, outperforming prior SOTA methods trained only on LRS3, which report a 21.4% WER. Additionally, we propose a new objective metric, TimeSync, specifically designed to measure phoneme-level temporal alignment between generated and reference speech, further ensuring synchronization quality. Demo: https://apple.github.io/visatronic-demo/

Abstract PDF HTML Upgrade to Chat

Summary

The paper introduces a novel VTTS task that integrates video, text, and speech into a unified framework using a decoder-only transformer architecture.
The paper employs unified multimodal embeddings and autoregressive learning to generate discrete mel-spectrograms from video and textual inputs.
The paper demonstrates superior performance on LRS3 and VoxCeleb2, achieving lower word error rates and enhanced temporal alignment compared to traditional TTS models.

A Technical Overview of "Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis"

The paper, "Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis," proposes a novel method for synthesizing speech by generating audio from visual inputs and transcripts of spoken language. This new task, termed Video-Text-to-Speech (VTTS), involves creating a seamless fusion of multimodal data to produce human-like speech outputs. The suggested framework, known as Visatronic, integrates video, text, and speech inputs into a unified output domain using a decoder-only transformer architecture. This work advances the understanding of multimodal interactions and presents promising directions for future research.

Task Definition and Motivation

The primary task addressed by the authors is generating speech from video inputs and corresponding textual transcripts (VTTS). This complex task extends beyond traditional text-to-speech (TTS) models by integrating lip-reading and other visual cues directly into the speech generation process, eliminating the requirement for separate lip-tracking models. The proposed model is generalizable and can potentially apply to multilingual and cross-lingual tasks, such as dubbing videos in different languages, thus enhancing the toolset for automatic speech processing.

Model Architecture

Visatronic employs a decoder-only multimodal transformer model, which integrates various modalities—namely video, text, and speech—into a single, comprehensive learning framework. This approach leverages the synchronous nature of these inputs to produce discrete mel-spectrograms, which serve as the intermediate representation for synthesizing speech.

Key methodological elements include:

Unified Multimodal Embedding: Each input modality (video, text, and speech) is mapped into a shared embedding space. Video frames are processed via a VQ-VAE encoder, text is tokenized at the character level, and speech is quantized into mel-frequency values using dMel.
Autoregressive Learning: The model learns the distribution of mel-spectrograms conditioned on video and text through an autoregressive process, optimizing cross-entropy loss on the discretized speech representations.
Dynamic Input Strategies: The authors experiment with various strategies for temporal alignment of inputs, either preserving natural order or optimizing cross-modal interactions through attention-based fusion techniques.

Experimental Results

The Visatronic model was benchmarked on LRS3 and VoxCeleb2 datasets using both subjective and objective evaluation metrics:

Word Error Rate (WER): In terms of transcription accuracy, Visatronic surpasses traditional TTS models and other contemporary lip reading systems that rely solely on video input by a notable margin.
TimeSync: A novel metric used to measure temporal alignment between generated and reference speech, demonstrating that the inclusion of video inputs leads to improved synchrony.
Human Evaluation: Subjective tests on intelligibility, naturalness, and synchronization confirm the efficacy of Visatronic in producing fluid, coherent speech outputs, surpassing a baseline TTS model both perceptually and functionally.

Implications and Future Directions

Visatronic's innovative use of a unified multimodal space for speech synthesis provides significant insights into how different sensory modalities can be seamlessly integrated to improve generation models. The model's flexible architecture and ability to handle multiple input types paves the way for enhancements in various application areas such as automated dubbing and assistive technologies for the hearing impaired.

Furthermore, the authors highlight potential for extending this research into multilingual domains, which could lead to breakthroughs in real-time cross-lingual communication tools. The release of clean transcriptions for VoxCeleb2 and the standardized VTTS evaluation protocol are expected to drive further advancements and refinements in this burgeoning field.

Through merging audiovisual and textual information into a sophisticated generative framework, this paper makes a significant contribution to the field of computational linguistics and multimodal artificial intelligence, setting a vital precedent for future research endeavors.