Speech2Video Synthesis with 3D Skeleton Regularization and Expressive Body Poses

Published 17 Jul 2020 in cs.CV, cs.LG, and eess.AS | (2007.09198v5)

Abstract: In this paper, we propose a novel approach to convert given speech audio to a photo-realistic speaking video of a specific person, where the output video has synchronized, realistic, and expressive rich body dynamics. We achieve this by first generating 3D skeleton movements from the audio sequence using a recurrent neural network (RNN), and then synthesizing the output video via a conditional generative adversarial network (GAN). To make the skeleton movement realistic and expressive, we embed the knowledge of an articulated 3D human skeleton and a learned dictionary of personal speech iconic gestures into the generation process in both learning and testing pipelines. The former prevents the generation of unreasonable body distortion, while the later helps our model quickly learn meaningful body movement through a few recorded videos. To produce photo-realistic and high-resolution video with motion details, we propose to insert part attention mechanisms in the conditional GAN, where each detailed part, e.g. head and hand, is automatically zoomed in to have their own discriminators. To validate our approach, we collect a dataset with 20 high-quality videos from 1 male and 1 female model reading various documents under different topics. Compared with previous SoTA pipelines handling similar tasks, our approach achieves better results by a user study.

Abstract PDF Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a novel two-stage pipeline that converts speech to 3D skeletal movements and synthesizes expressive full-body videos.
It leverages RNNs for dynamic 3D pose generation and conditional GANs with part-specific attention to achieve high visual fidelity.
The study demonstrates superior audio-visual synchronization and expressiveness, validated by comprehensive comparative evaluations.

Analyzing "Speech2Video Synthesis with 3D Skeleton Regularization and Expressive Body Poses"

The paper "Speech2Video Synthesis with 3D Skeleton Regularization and Expressive Body Poses" presents a method to generate photo-realistic videos driven by speech inputs. The new approach effectively addresses inherent challenges in previous methods, which typically focused on mouth or facial motion, by generating full-body expressive videos with synchronized audio-visual coherence.

Approach and Methodology

The methodology comprises a two-stage pipeline involving recurrent neural networks (RNNs) and generative adversarial networks (GANs). Initially, an RNN is employed to convert audio inputs into 3D skeleton movements, capturing the essential structure and dynamics of human gestures. Critical to this phase is the incorporation of a 3D skeleton model and a dictionary of personal gestures, which informs the variability and personal nature of human expressions. This step ensures the semantic relevance of gestures and bodily movements to speech, surpassing the capabilities of traditional deterministic models that fail to encapsulate individual expressiveness.

In the second phase, a conditional GAN synthesizes videos from the generated 3D skeletons, focusing on photo-realistic results and detail preservation. To address blurring and distortion issues, part-specific attention mechanisms are introduced, which concentrate GAN discriminators on detailed rendering of crucial components like hands and face. This nuanced attention technique enables high-resolution output, crucial for visual fidelity.

Strong Results and Evaluation

The authors validated their approach utilizing a dataset with diverse video recordings of individuals reading literature from varying domains. In comparative evaluations against state-of-the-art methods, this technique was shown to achieve superior results, particularly in a user study assessing perceptual quality, expressiveness, and correlation with audio inputs. The proposed method demonstrated an ability to capture nuanced body language synchronized with speech, resulting in richer and more life-like animations.

Implications and Future Directions

The significance of this work lies in its contribution to the synthesis of believable human avatars from speech inputs, with implications in fields ranging from entertainment to virtual communication. By enhancing expressiveness through a pose dictionary and 3D constraints, this approach confronts and overcomes limitations in gesture modeling, improving contextual and individual authenticity.

Looking ahead, this research opens pathways for refining low-cost, high-fidelity avatar generation systems. Future work might involve scaling the datasets to encompass a wider diversity of gestures and further optimizing part-specific attention mechanisms. Additionally, extending the framework to capture and synthesize diverse emotional cues could further improve the realism and applicability of synthesized videos.

Overall, this paper provides a thorough and nuanced exploration into speech-driven video synthesis, leveraging advanced neural architectures for improved realism and expressive motion capture. Key contributions include the integration of 3D skeletal constraints and personalized gesture dictionaries, which significantly enhance the interpretive power and accuracy of synthesized human videos.

Markdown Report Issue