Towards Expressive Video Dubbing with Multiscale Multimodal Context Interaction

Published 25 Dec 2024 in cs.MM, cs.CL, cs.SD, and eess.AS | (2412.18748v2)

Abstract: Automatic Video Dubbing (AVD) generates speech aligned with lip motion and facial emotion from scripts. Recent research focuses on modeling multimodal context to enhance prosody expressiveness but overlooks two key issues: 1) Multiscale prosody expression attributes in the context influence the current sentence's prosody. 2) Prosody cues in context interact with the current sentence, impacting the final prosody expressiveness. To tackle these challenges, we propose M2CI-Dubber, a Multiscale Multimodal Context Interaction scheme for AVD. This scheme includes two shared M2CI encoders to model the multiscale multimodal context and facilitate its deep interaction with the current sentence. By extracting global and local features for each modality in the context, utilizing attention-based mechanisms for aggregation and interaction, and employing an interaction-based graph attention network for fusion, the proposed approach enhances the prosody expressiveness of synthesized speech for the current sentence. Experiments on the Chem dataset show our model outperforms baselines in dubbing expressiveness. The code and demos are available at \textcolor[rgb]{0.93,0.0,0.47}{https://github.com/AI-S2-Lab/M2CI-Dubber}.

Abstract PDF Upgrade to Chat

Summary

The paper introduces M2CI-Dubber, which leverages multiscale feature extraction and multimodal fusion to enhance video dubbing expressiveness.
It employs self-attention, cross-attention, and graph attention networks to integrate global and local prosodic features with text signals.
Experimental results on the Chem dataset show statistically significant improvements over baselines, promising cost-effective, high-quality dubbing.

Overview of M2CI-Dubber for Expressive Video Dubbing

The research paper titled "Towards Expressive Video Dubbing with Multiscale Multimodal Context Interaction" addresses the domain of Automatic Video Dubbing (AVD) by introducing a novel system named M2CI-Dubber. This work proposes a methodological advancement in AVD, focusing on enhancing the prosody expressiveness of synthesized speech by utilizing multiscale and multimodal context interactions.

Key Contributions and Methodology

M2CI-Dubber is designed to address two specific challenges in the field of AVD. First, it considers multiscale prosody expression attributes from contextual information that affect the prosody of the current sentence. Second, it emphasizes the interaction between prosody cues in the context and the ongoing sentence to influence the final speech output. The paper proposes a Multiscale Multimodal Context Interaction (M2CI) scheme as a solution, which encompasses the following innovations:

Multiscale Feature Extraction: This process uses dedicated encoders for each modality—video, text, and audio—to generate both global sentence-level and local phoneme-level features. These features are necessary for capturing the comprehensive prosody expression present in different contextual scales.
Interaction-based Multiscale Aggregation (IMA): Within IMA, multiscale aggregators leverage self-attention and cross-attention mechanisms to interact relevant global and local features with the current sentence's text features. The mechanism allows for the integration of aggregated prosodic data.
Interaction-based Multimodal Fusion (IMF): The paper employs a graph attention network with intra-modal and inter-modal edges to facilitate a deep, multimodal fusion of audio, video, and text features with the current text. This ensures enriched feature interaction and enhanced expressive dubbing.
Video Dubbing Synthesizer: Utilizing HPMDubbing as a backbone, this synthesizer combines various multimodal and multiscale features to produce synthesized speech that closely mimics the natural prosodic variations present in reference audio.

Experimental Evaluation

The evaluation was conducted using the Chem dataset, highlighting the system's superiority in generating prosodically expressive speech compared to existing methods. The study reports significant improvements in metrics like Gross Pitch Error (GPE), F0 Frame Error (FFE), and Mean Opinion Scores (MOS) for context and similarity. The proposed M2CI-Dubber demonstrated statistical superiority (with a p-value < 0.001) over various baselines, including FastSpeech2, DSU-AVO, HPMDubbing, and MCDubber. This suggests that the paper's approach to multiscale and multimodal context modeling effectively enhances dubbing expressiveness.

Implications and Future Directions

The presented research signifies an advancement in integrating multiscale and multimodal data for prosody modeling in AVD. From a practical standpoint, M2CI-Dubber offers potential cost savings by improving the automation and quality of speech dubbing without the need for professional voice actors. Theoretically, it underscores the importance of deep contextual modeling and interaction, providing a template for future exploration in expressive audio synthesis.

Future work may explore the integration of emotion modeling within AVD systems, leveraging the insights provided by this study on multiscale multimodal interactions. Additionally, the expansion of the model to accommodate varying speaker characteristics and dialects could further enhance its adaptability and robustness in different dubbing scenarios.

In conclusion, this study lays a foundational framework for enriching expressive attributes in synthesized speech through advanced context interaction mechanisms, offering significant contributions to the field of Automatic Video Dubbing.