Pixel3DMM: Versatile Screen-Space Priors for Single-Image 3D Face Reconstruction

Published 1 May 2025 in cs.CV and cs.AI | (2505.00615v1)

Abstract: We address the 3D reconstruction of human faces from a single RGB image. To this end, we propose Pixel3DMM, a set of highly-generalized vision transformers which predict per-pixel geometric cues in order to constrain the optimization of a 3D morphable face model (3DMM). We exploit the latent features of the DINO foundation model, and introduce a tailored surface normal and uv-coordinate prediction head. We train our model by registering three high-quality 3D face datasets against the FLAME mesh topology, which results in a total of over 1,000 identities and 976K images. For 3D face reconstruction, we propose a FLAME fitting opitmization that solves for the 3DMM parameters from the uv-coordinate and normal estimates. To evaluate our method, we introduce a new benchmark for single-image face reconstruction, which features high diversity facial expressions, viewing angles, and ethnicities. Crucially, our benchmark is the first to evaluate both posed and neutral facial geometry. Ultimately, our method outperforms the most competitive baselines by over 15% in terms of geometric accuracy for posed facial expressions.

Abstract PDF Upgrade to Chat

Summary

The paper introduces Pixel3DMM, using dual vision transformer networks to predict UV-coordinates and surface normals for optimizing 3DMM parameters.
It achieves a 15% improvement in geometric accuracy over methods like DECA and EMOCA, demonstrating robust performance in posed face reconstruction.
The study establishes a new benchmark while addressing challenges in identity-expression disentanglement, paving the way for future research.

Overview of "Pixel3DMM: Versatile Screen-Space Priors for Single-Image 3D Face Reconstruction"

The paper presents Pixel3DMM, an innovative approach to address the problem of 3D face reconstruction from a single RGB image by leveraging pixel-aligned geometric cues. This method employs a pair of Vision Transformers (ViTs) that predict per-pixel surface normals and UV-coordinates, using these geometric cues to constrain the optimization of a 3D morphable face model (3DMM). The focus is to advance the robustness and fidelity of 3D face reconstructions, overcoming traditional challenges like depth ambiguities and expression disentanglement.

Methodology

Pixel3DMM Architecture

Pixel3DMM utilizes two vision transformer networks tailored to predict UV-coordinates and surface normals, effectively refining 3D face reconstruction. The foundational elements are drawn from the DINOv2 model, which provides robust latent features. The network architecture extends the standard ViT with a prediction head consisting of transformer blocks and up-convolutions to achieve high-resolution output predictions.

Figure 1: Method Overview: Pixel3DMM consists of (a) learning pixel-aligned geometric priors (left) and (b) test-time optimization against predicted UV-coordinates and normals (right).

Training and Data Preparation

The model is trained using high-quality 3D datasets, namely NPHM, FaceScape, and Ava256, registered against the FLAME model topology. Techniques such as non-rigid registration and random sampling of light and camera parameters enhance the model's robustness. This results in a comprehensive dataset encompassing diverse identities and expressions.

Fitting Strategy

The 3DMM parameters are optimized through a combination of surface normal and UV-coordinate predictions, utilizing both 2D vertex loss and a loss function based on estimated normals. This optimization is applied not only to single images but also extended to monocular video sequences, ensuring smoothness and fidelity across frames.

Benchmark Development

To evaluate the effectiveness of Pixel3DMM, a new benchmark was introduced using the multi-view video dataset NeRSemble. This benchmark is designed to test both posed and neutral geometries, offering a detailed evaluation protocol that includes metrics such as L1 and L2 Chamfer distances, normal consistency, and recall rates.

Figure 2: 3D Face Reconstruction Benchmark Analysis. We show the 5 most diverse images from each benchmark dataset, as measured by the expression codes of EMOCA.

Experimental Results

Pixel3DMM demonstrates superior performance in 3D face reconstruction tasks, significantly outperforming existing methods such as DECA and EMOCA in posed scenarios, with a 15% improvement in geometric accuracy. The results on the neutral task, however, highlight challenges in disentangling identity and expression when relying on optimization-based approaches.

Figure 3: Qualitative Comparison (Posed): We show overlays of the reconstructed meshes to judge the reconstruction alignment.

Implications and Future Work

Pixel3DMM marks a step forward in single-image 3D face reconstruction by integrating advanced geometric priors within the optimization process, however, challenges remain in enhancing the rapidity and scalability of these methods, particularly for real-time and generative applications. Future work could explore integrating multi-view information into the foundational architecture or refining the optimization strategies to improve identity-expression disentanglement further.

Figure 4: Qualitative Comparison (Neutral): Alignment of the neutral prediction against the neutral image and scan of a person.

The introduction of new benchmarks provides a pathway for more nuanced comparisons across methodologies, especially in facets of posed geometry reconstruction and underlying disentanglement strategies.

Conclusion

The paper introduces a novel method for single-image 3D face reconstruction that significantly advances the fidelity and applicability of face modeling techniques. Pixel3DMM leverages pixel-aligned geometric priors to refine 3DMM parameter estimation, establishing new standards against which future technologies in this domain can be assessed. The comprehensive use of public datasets and efficient training setups encourages continued exploration and innovation in related areas.

Markdown Report Issue