FLAME 3D Morphable Model (3DMM)

Updated 28 January 2026

FLAME 3DMM is a parametric model that represents the full human head using distinct shape, expression, and pose parameters.
It employs PCA-based bases and a learned skinning operator to generate anatomically plausible, poseable meshes for robust facial reconstruction.
FLAME enables advanced applications such as neural volumetric rendering and expression inference, achieving state-of-the-art performance on various benchmarks.

The FLAME (Faces Learned with an Articulated Model and Expressions) 3D Morphable Model (3DMM) is a parametric model specifically designed to represent the full human head, including facial identity, expressions, and articulated pose, in a way that is both compact and highly disentangled. FLAME’s architecture allows for robust statistical modeling of facial geometry and animation-ready mesh deformations, supporting applications including facial reconstruction from images, neural volumetric rendering, animation, and facial expression inference.

1. Mathematical Formulation and Parameterization

FLAME models the human head mesh via a low-dimensional parameter space, enabling the generation of anatomically plausible, poseable 3D head shapes. The formulation consists of three main sets of parameters:

Shape coefficients $\alpha\in\mathbb{R}^{n_s}$ (typically $n_s=100$ ), encoding subject-specific identity variation.
Expression coefficients $\delta\in\mathbb{R}^{n_e}$ (typically $n_e=50$ ), parameterizing facial expressions.
Pose parameters $\theta\in\mathbb{R}^{n_p}$ (e.g., $n_p=6$ , covering global rotation and jaw articulation).

The mean template mesh $\bar S$ is combined with principal component analysis (PCA) bases for shape $B_s$ and expression $B_e$ , and a pose basis $B_p$ for corrective deformations: $n_s=100$ 0 The unposed mesh is

$n_s=100$ 1

A learned linear blend skinning operator $n_s=100$ 2 applies pose-dependent articulation over joints $n_s=100$ 3. The posed mesh is then: $n_s=100$ 4 Optionally, a global scale $n_s=100$ 5 and translation $n_s=100$ 6 are applied,

$n_s=100$ 7

This structure ensures disentangled, interpretable, and differentiable mapping between parameter vector $n_s=100$ 8 and mesh vertices $n_s=100$ 9, with $\delta\in\mathbb{R}^{n_e}$ 0 typically 10,000–20,000.

2. FLAME-Based Reconstruction Pipelines

Modern 3D face reconstruction pipelines leverage FLAME as the target parameter space for fitting 3D head geometry and appearance from monocular or multi-view images. KaoLRM (Zhu et al., 19 Jan 2026) exemplifies this by projecting features from a pretrained Large Reconstruction Model (LRM) into FLAME parameters through a gating and regression scheme. Specifically:

LRM produces triplane features $\delta\in\mathbb{R}^{n_e}$ 1, which are flattened into tokens $\delta\in\mathbb{R}^{n_e}$ 2.
A self-gating MLP predicts per-token gates $\delta\in\mathbb{R}^{n_e}$ 3, yielding gated tokens $\delta\in\mathbb{R}^{n_e}$ 4.
A regressor $\delta\in\mathbb{R}^{n_e}$ 5 outputs predicted FLAME parameters: $\delta\in\mathbb{R}^{n_e}$ 6.

Supervision can include a landmark loss: $\delta\in\mathbb{R}^{n_e}$ 7 as well as $\delta\in\mathbb{R}^{n_e}$ 8 regularizers on shape and expression: $\delta\in\mathbb{R}^{n_e}$ 9

For appearance modeling, KaoLRM applies FLAME-based 2D Gaussian splatting. Points $n_e=50$ 0 are densely sampled on the reconstructed mesh; each is transformed into a 2D Gaussian primitive in image space, and appearance is rendered by weighted splatting. Rendering and binding losses (relying on depth and normals from both mesh and Gaussian splats) enable end-to-end optimization.

Multi-stage training is common: landmark and regularization losses enable coarse alignment, then photometric and geometric binding losses refine fine structure and appearance. KaoLRM demonstrates this yields state-of-the-art reconstruction accuracy and cross-view consistency on benchmarks such as FaceVerse and NoW (Zhu et al., 19 Jan 2026).

3. Integration with Neural Volumetric Rendering

Several frameworks combine FLAME's explicit mesh structure with implicit neural representations like NeRF to obtain both photorealistic rendering and full expression/pose control.

NeRFlame (Zając et al., 2023) and FLAME-in-NeRF (Athar et al., 2021) both incorporate FLAME in radiance field pipelines:

The FLAME mesh generates a dense 3D surface; a distance field $n_e=50$ 1 is defined to be nonzero only near the mesh surface: $n_e=50$ 2 with $n_e=50$ 3 the minimum distance from $n_e=50$ 4 to the mesh.
For color, a NeRF-MLP $n_e=50$ 5 receives positional encodings and predicts RGB: $n_e=50$ 6.
Control over expression and pose is achieved by manipulating FLAME parameters, which induce deformations of both the mesh and the NeRF density support.

FLAME-in-NeRF further conditions the NeRF MLP on the FLAME expression code, concatenated to input layers, and uses a spatial prior (occupancy mask) to enforce that only facial regions respond to expression changes. Training employs combined losses: photometric, regularization on parameters, and novel disentanglement and spatial priors (Athar et al., 2021).

Joint optimization over neural rendering weights and FLAME parameters, often in multiple training phases, produces models with high-fidelity reconstructions that are directly controllable via FLAME's low-dimensional latent space (Zając et al., 2023).

4. Data-Driven FLAME Fitting from Images

FLAME parameter extraction is commonly performed via deep regression models trained to predict FLAME codes from monocular imagery under a battery of self-supervised and supervised objectives.

Anisetty et al. (Anisetty et al., 2022) develop an unsupervised encoder for in-the-wild images that outputs FLAME coefficients regulating both facial and full-head shape, even under severe hair occlusion. Core components include:

Dice consistency loss aligning the silhouette of rendered mesh (post-hair-inpainting) to observed skin.
Scale consistency loss ensuring shape invariance across varying crop levels (tight/loose framing).
Landmark detection for extended 71-point topology to constrain upper-head reconstruction.
Encoder consistency and regularization to stabilize predicted parameters.

The system yields competitive performance on face (NoW) and full-head (CoMA, LYHM) evaluation datasets, confirming FLAME's utility for unsupervised, accurate geometry recovery from unconstrained images (Anisetty et al., 2022).

5. Applications in Facial Expression Inference and Recognition

FLAME-derived representations encode rich information on both facial identity and expression, and recent work has incorporated these 3D parameters as feature spaces for facial expression inference (FEI) tasks.

Ig3D (Dong et al., 2024) conducts a systematic study, evaluating both “short” (just $n_e=50$ 7) and “full” (all regressed FLAME parameters) embeddings extracted via EMOCA or SMIRK regressors. Two fusion strategies are analyzed:

Intermediate fusion: 3DMM parameters are projected and concatenated with late-stage 2D CNN features, then passed joint through final MLPs.
Late fusion: 2D and 3D-based classifiers/regressors produce predictions which are fused at the score level (max, mean, weighted).

On AffectNet and RAF-DB, fusing FLAME embeddings—especially via late fusion—delivers consistent improvements over 2D-only baselines for both discrete expression classification (e.g., $n_e=50$ 8 accuracy on RAF-DB) and valence-arousal regression (e.g., $n_e=50$ 9 MSE on AffectNet VA), thereby validating the complementary power of 3DMM-based features (Dong et al., 2024).

6. Supervision, Losses, and Optimization Strategies

FLAME-based models are typically supervised using a mix of geometric, photometric, and perceptual losses, as well as statistical priors on latent codes:

Landmark alignment (commonly, $\theta\in\mathbb{R}^{n_p}$ 0): improves geometric consistency across views.
Photometric and perceptual losses (e.g., pixelwise, VGG feature, D-SSIM).
Regularization (e.g., $\theta\in\mathbb{R}^{n_p}$ 1 losses on $\theta\in\mathbb{R}^{n_p}$ 2, $\theta\in\mathbb{R}^{n_p}$ 3, $\theta\in\mathbb{R}^{n_p}$ 4) to prevent parameter drift.
Specialized terms such as dice loss for head silhouette, scale consistency for invariance to image crop, encoder consistency, and geometric binding in renderer-fitted pipelines.

Optimization can proceed in multi-stage fashion: coarse alignment is established using geometric cues, followed by photometric/appearance refinement. In end-to-end neural pipelines (e.g., NeRFlame), mesh and NeRF weights are co-optimized, sometimes with staged schedules to balance mesh rigidity and appearance flexibility (Zając et al., 2023, Zhu et al., 19 Jan 2026, Anisetty et al., 2022, Dong et al., 2024).

7. Quantitative Performance and Impact

FLAME-based 3DMM pipelines consistently deliver state-of-the-art results across multiple 3D face benchmarks:

KaoLRM: Chamfer mean on FaceVerse $\theta\in\mathbb{R}^{n_p}$ 5 (cf. DECA $\theta\in\mathbb{R}^{n_p}$ 6, EMOCA $\theta\in\mathbb{R}^{n_p}$ 7, SMIRK $\theta\in\mathbb{R}^{n_p}$ 8); NoW challenge split mean $\theta\in\mathbb{R}^{n_p}$ 9 mm (cf. DECA $n_p=6$ 0 mm) (Zhu et al., 19 Jan 2026).
Occlusion-robustness: Dice and scale invariant losses yield accurate full-head reconstructions even with occluding hair (Anisetty et al., 2022).
Expression transfer and recognition: FLAME-based features, when fused with 2D CNN or transformer pipelines, provide significant accuracy and robustness boosts for emotion recognition and valence-arousal estimation (Dong et al., 2024).