PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering

Published 17 Sep 2021 in cs.CV and cs.AI | (2109.08379v1)

Abstract: Generating portrait images by controlling the motions of existing faces is an important task of great consequence to social media industries. For easy use and intuitive control, semantically meaningful and fully disentangled parameters should be used as modifications. However, many existing techniques do not provide such fine-grained controls or use indirect editing methods i.e. mimic motions of other individuals. In this paper, a Portrait Image Neural Renderer (PIRenderer) is proposed to control the face motions with the parameters of three-dimensional morphable face models (3DMMs). The proposed model can generate photo-realistic portrait images with accurate movements according to intuitive modifications. Experiments on both direct and indirect editing tasks demonstrate the superiority of this model. Meanwhile, we further extend this model to tackle the audio-driven facial reenactment task by extracting sequential motions from audio inputs. We show that our model can generate coherent videos with convincing movements from only a single reference image and a driving audio stream. Our source code is available at https://github.com/RenYurui/PIRender.

Abstract PDF Upgrade to Chat

Citations (192)

View on Semantic Scholar

Summary

The paper introduces PIRenderer, a novel approach that uses 3DMM parameters for intuitive control over portrait image generation.
It employs a mapping network, warping network, and editing network to achieve photorealistic modifications while preserving key facial attributes.
Quantitative analyses using FID, AED, APD, and LPIPS demonstrate superior performance and accurate cross-identity motion imitation compared to state-of-the-art methods.

Controllable Portrait Image Generation via Semantic Neural Rendering

This essay provides an overview and analysis of the paper titled "PIRenderer: Controllable Portrait Image Generation via Semantic Neural Rendering." The authors introduce PIRenderer, a model designed to generate and control portrait images using semantically meaningful parameters extracted from three-dimensional morphable face models (3DMMs). This approach marks a shift from traditional methods that often rely on indirect editing techniques or subject-specific motion descriptors.

Technical Approach

PIRenderer leverages the parameters of 3DMMs, allowing intuitive control over facial expressions and movements. The model architecture consists of three main components:

Mapping Network: This network transforms target motion descriptors into latent vectors that serve as the control signal for subsequent networks.
Warping Network: Using the latent vectors, this network estimates deformations between the source and target images, providing a coarse rendering by warping the source image.
Editing Network: This network refines the warped image to produce the final high-quality portrait image, ensuring realistic expressions and poses.

This structured approach enables PIRenderer to produce photo-realistic results while maintaining other source attributes like identity and illumination.

Quantitative Assessment

The performance of PIRenderer is assessed using several metrics:

Fréchet Inception Distance (FID): This measures the realism of generated images compared to real images.
Average Expression Distance (AED) and Average Pose Distance (APD): These metrics evaluate the accuracy of expression and pose reproduction, respectively.
Learned Perceptual Image Patch Similarity (LPIPS): This measures the perceptual similarity between generated and ground truth images, providing insight into the model's reconstruction capabilities.

The results indicate superior performance in both direct and indirect editing tasks, demonstrating the model's ability to generate realistic and coherent videos.

Comparisons and Implications

PIRenderer is compared with state-of-the-art methods like X2Face and FOMM. The study highlights PIRenderer's ability to generate more realistic and accurate depictions of facial movements and expressions. Especially in cross-identity motion imitation, the model shows enhanced performance due to its disentangled motion descriptors.

Moreover, the model's extension to audio-driven facial reenactment demonstrates its capacity to handle more complex tasks. By mapping audio inputs to 3DMM parameters, PIRenderer generates meaningful facial and pose transformations from audio streams, showcasing its versatility and potential for applications like virtual avatars and real-time video synthesis.

Future Directions

The research opens several avenues for future exploration:

Enhanced Editing Capabilities: Further refining the latent space mapping to improve editing precision and zero-shot motion transfer capabilities.
Integration with Other Modalities: Broadening the input spectrum, such as integrating text or gesture controls, to create multi-modal interactive systems.
Real-time Processing: Optimizing computational efficiency for real-time applications in virtual reality and social media contexts.

Overall, the paper makes a significant contribution to the field of image-based facial animation, offering a robust tool for intuitive and controlled portrait image generation. The insights gained from this research may inform the development of advanced neural rendering systems that seamlessly integrate with multimedia content creation pipelines.

Markdown Report Issue