NeRFlame: Neural Face Rendering
- NeRFlame is a neural face rendering framework that fuses FLAME’s explicit 3D morphable mesh model with NeRF’s volumetric rendering to achieve semantic control and photorealism.
- It leverages FLAME mesh parameterization, density gating near the face, and FiLM-based expression conditioning to enable precise manipulation of facial identity, expression, and pose.
- Evaluations reveal competitive image fidelity with full NeRF methods while offering enhanced editability, despite challenges like potential reconstruction trade-offs and static background handling.
NeRFlame is a family of neural face rendering models that fuse the semantic controllability of FLAME (Faces Learned with an Articulated Model and Expressions), a 3D morphable mesh model, and the photorealistic synthesis capabilities of Neural Radiance Fields (NeRF). NeRFlame achieves explicit control over facial identity, expression, and pose while preserving fine-scale appearance details through volumetric rendering. This article summarizes the design principles, mathematical formulations, conditioning mechanisms, network architectures, training procedures, evaluation metrics, and limitations of NeRFlame and related FLAME-in-NeRF frameworks (Athar et al., 2021, &&&1&&&).
1. FLAME Mesh Parameterization and Semantic Control
At the core of NeRFlame is the FLAME model, which represents human head geometry as a low-dimensional, PCA-based morphable mesh. For a given subject, the vertex locations are parametrized as:
where encodes identity (), encodes expression (, –$100$), and provide global rigid transformation, is the mean head mesh, and , are PCA bases learned from 4D scans [FLAME:SiggraphAsia2017]. In practice, NeRFlame fixes the identity per subject and estimates frame-wise expressions by 2D landmark fitting:
where are detected facial landmarks, is a weak-perspective or perspective camera projection, and regularizes expressions. Once fit, FLAME provides explicit control over head shape, facial deformation, and pose.
2. Volume Rendering and Density Formulation
NeRFlame integrates FLAME with NeRF-style differentiable volume rendering. Rather than modeling density and color via a global MLP on , NeRFlame gates volume density tightly around the FLAME mesh . For any query point in space:
where is the shortest Euclidean distance to the mesh, and is a shell thickness hyperparameter (e.g., $0.1$). This explicit density shell prevents color prediction outside the mesh vicinity and ensures semantic locality. Rendering proceeds along rays , accumulating color only inside the shell:
where transmittance and pixel color is approximated via quadrature:
with sampled depths , .
3. Conditioning and Disentanglement Mechanisms
NeRFlame and FLAME-in-NeRF implement semantic control by integrating FLAME parameters into density and color prediction. Two strategies arise:
- Density gating: The FLAME mesh directly defines the region of nonzero density, focusing learning on morphology and suppressing non-face regions (Zając et al., 2023).
- Expression conditioning via FiLM: In FLAME-in-NeRF, the facial expression vector modulates MLP activations via feature-wise affine (FiLM) transformations:
where are learned functions of . This injects expression control at multiple network depths (Athar et al., 2021).
- Spatial prior loss: A proxy density (narrow Gaussian around FLAME) is used as a spatial prior, promoting disentanglement:
This penalizes unwanted density drift in non-facial regions when expressions change.
4. Network Architecture and Training Procedures
Both NeRFlame and FLAME-in-NeRF utilize deep multi-layer perceptrons for color and fine density refinement:
- NeRFlame MLPs: 8 layers, 256 hidden units, ReLU, skip connections at layer 4, layer norm. Color MLP receives positional-encoded spatial and view vectors, with a 4-layer head for view dependence. Density MLP is activated in a second training phase for sub-shell refinement (Zając et al., 2023).
- FLAME-in-NeRF MLPs: 8 layers, width 256, skip connection at layer 4; final output splits into scalar density and view-dependent color branches. FiLM layers inject expression codes (Athar et al., 2021).
Training proceeds in either end-to-end or phased stages:
- Phase 1: Fix density via analytic FLAME shell, optimize mesh and color MLP.
- Phase 2: Freeze mesh parameters, learn refined density and color jointly, gradually increase for shell width.
Dataset preparation typically involves video or multi-view portrait capture, landmark detection, FLAME fitting, camera pose estimation, and ray sampling. Supervision is given by L2 photometric reconstruction loss:
Optionally, regularizers on expression parameters, mesh priors, and total variation of density are included.
5. Experimental Evaluation and Results
Quantitative and qualitative assessments demonstrate NeRFlame’s ability to balance photorealism and semantic editability:
| Model | PSNR | SSIM | LPIPS |
|---|---|---|---|
| Classical NeRF | 31–33 dB | 0.96–0.97 | 0.04–0.06 |
| NeRFlame (editable) | 25–32 dB | 0.92–0.96 | 0.05–0.10 |
| FLAME (mesh only) | 9–13 dB | -- | -- |
Values summarized from (Zając et al., 2023).
- Image fidelity: NeRFlame retains sharp skin and hair details, achieving PSNR and SSIM only slightly below unconstrained NeRF, and far above textured FLAME-only baselines.
- Editability: Arbitrary expressions, blendshapes, and facial manipulations are supported without retraining, via FLAME parameter variation.
- Localized edits: Changes in expression or pose effect only the face region; backgrounds and non-face features remain stable.
- Video synthesis: FLAME-in-NeRF enables free-viewpoint portrait animation with expression controls from short selfie videos (Athar et al., 2021).
6. Discussion of Limitations and Future Directions
Several limitations are noted for NeRFlame-based approaches:
- Density gating trade-offs: The use of hard FLAME-based density can slightly degrade reconstruction quality compared to global NeRF because color prediction is confined to the shell region, which may omit fine details at boundaries (Zając et al., 2023).
- Inextensible mesh: Mouth interiors and regions not modeled by FLAME (e.g., open cavities, non-rigid hair or clothing) are sources of rendering artifacts; rays traversing these areas may need to be culled.
- Sensitivity: Hyperparameters such as (shell thickness) and the duration of initial mesh-aligned training phase require tuning to balance fitting against leakage.
- Person-specific fitting: FLAME-in-NeRF requires per-subject training and lacks immediate generalization to unseen identities (Athar et al., 2021).
- Static/rigid background: Modeling dynamic background is outside the current framework; backgrounds remain static unless further modeling is added.
Potential future developments include augmentation of FLAME geometry (mouth/eye interiors), stronger priors to prevent extreme mesh deformations, multi-subject joint training for identity generalization, and hybrid density–texture methods to approach unconstrained NeRF fidelity (Zając et al., 2023).
7. Context and Significance within Neural Rendering
NeRFlame exemplifies the synthesis of explicit statistical morphable models and implicit neural volumetric representations to produce high-fidelity, editable 3D faces. By restricting volumetric density to the analytically derived FLAME surface, NeRFlame enables direct semantic control of rendered faces without sacrificing detail, bridging gaps between the controllability of mesh-based models and the realism of radiance fields. This design pattern is significant for generative graphics, free-viewpoint video, and animation, where semantic control and visual fidelity must co-exist. NeRFlame’s formulation provides a foundation for future neural rendering research seeking modular, interpretable, and photorealistic model architectures (Athar et al., 2021, Zając et al., 2023).