Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss

Published 9 May 2019 in cs.CV | (1905.03820v1)

Abstract: We devise a cascade GAN approach to generate talking face video, which is robust to different face shapes, view angles, facial characteristics, and noisy audio conditions. Instead of learning a direct mapping from audio to video frames, we propose first to transfer audio to high-level structure, i.e., the facial landmarks, and then to generate video frames conditioned on the landmarks. Compared to a direct audio-to-image approach, our cascade approach avoids fitting spurious correlations between audiovisual signals that are irrelevant to the speech content. We, humans, are sensitive to temporal discontinuities and subtle artifacts in video. To avoid those pixel jittering problems and to enforce the network to focus on audiovisual-correlated regions, we propose a novel dynamically adjustable pixel-wise loss with an attention mechanism. Furthermore, to generate a sharper image with well-synchronized facial movements, we propose a novel regression-based discriminator structure, which considers sequence-level information along with frame-level information. Thoughtful experiments on several datasets and real-world samples demonstrate significantly better results obtained by our method than the state-of-the-art methods in both quantitative and qualitative comparisons.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (362)

View on Semantic Scholar

Summary

The paper introduces a hierarchical cascade GAN that converts audio cues into facial landmarks before generating refined video frames.
It employs a dynamic pixel-wise loss with an attention mechanism to minimize artifacts and maintain temporal coherence.
Comparative evaluations on GRID, LRW, and VoxCeleb show significant improvements in PSNR, SSIM, and audiovisual synchronization.

In the endeavor of generating realistic talking face videos guided by audio cues, the research introduces a novel cascade GAN framework. This research is a significant contribution to the field of computer vision, particularly focusing on enhancing the realism and synchronization of synthetic face videos in diverse scenarios. The approach is set apart by its hierarchical structure, which addresses key challenges in audiovisual synthesis, such as noise sensitivity and the handling of various facial dynamics.

Cascade GAN Structure

The central innovation of the paper lies in its hierarchical, two-stage approach that distinctly handles the translation of audio to video. Initially, the model converts audio signals into high-level facial landmarks rather than attempting a direct audio-to-frame mapping. This intermediate step mitigates the issue of irrelevant audiovisual correlations that can distort the video output. Subsequently, these landmarks serve as the basis for generating the final video frames, which enhances temporal continuity and synchronization with the audio input.

Attention-Based Loss and Regression Discriminator

The paper presents a novel dynamically adjustable pixel-wise loss function coupled with an attention mechanism that targets audiovisual-relevant regions, thereby minimizing artifacts such as pixel jittering. This is crucial given the human sensitivity to visual inconsistencies in videos. Complementing this, the regression-based discriminator evaluates videographic realism at both sequence-level and frame-level, using a convolutional RNN-based generator that maintains temporal coherence across frames. The discriminator uniquely regresses the landmarks, steering the generator to produce accurate video sequences with improved audiovisual synchronization.

Numerical Results and Comparative Evaluations

Extensive experiments conducted on datasets like GRID, LRW, and VoxCeleb demonstrate that the proposed model surpasses current state-of-the-art methods in both qualitative and quantitative metrics. The model achieves significant improvements in perceptual metrics such as PSNR and SSIM, while the Landmarks Distance (LMD) metric indicates enhanced audiovisual synchronization. Such results underscore the superiority of the hierarchical approach in handling diverse face types and noisy conditions, effectively accommodating dynamic head movements and various facial characteristics.

Implications and Future Directions

The implications of this work are manifold, impacting practical applications in entertainment, virtual reality, and assistive technologies. In lip-reading applications for the hearing-impaired, for instance, the ability to generate and synchronize facial motions accurately has potential utility. Theoretically, the introduction of a hierarchical, cascade mechanism could stimulate further exploration into cross-modal synthesis problems.

Future research could explore integrating more nuanced expressions and head dynamics beyond lip movements, potentially incorporating emotional context into the generated videos. This may involve developing more sophisticated representations of input modalities or enhancing the intermediate feature transformations to capture a broader range of human facial expressions. What is clear is that the framework established by this research provides a solid foundation for advancing the capabilities and applications of talking face generation technologies.

Markdown Report Issue