KaoLRM: 3D Face Reconstruction Neural Architecture
- KaoLRM is a neural architecture that repurposes a pre-trained LRM for FLAME-based 3D face regression, addressing cross-view consistency issues.
- It fuses multi-view-trained triplane features from a frozen DINOv2 backbone with a parametric mesh regression and 2D Gaussian Splatting for appearance modeling.
- Evaluations on FaceVerse and NoW benchmarks show reduced geometric variance and improved stability compared to conventional 3DMM regressors.
KaoLRM is a neural architecture for parametric 3D face reconstruction that repurposes the learned prior of a pre-trained Large Reconstruction Model (LRM) and fuses it with a parametric 3D Morphable Model (3DMM) regression pipeline. It targets the longstanding challenge in 3DMM-based facial reconstruction of achieving accurate, robust, and cross-view-consistent shape and expression prediction from single-view images. By projecting LRM’s multi-view-trained 3D triplane features into the FLAME parameter space and coupling a 2D Gaussian Splatting module to the parametric mesh, KaoLRM achieves high-fidelity geometry, reliable appearance modeling, and substantially improved consistency under viewpoint variability or self-occlusion (Zhu et al., 19 Jan 2026).
1. Motivation and Background
Parametric 3DMMs, especially FLAME, are widely used in 3D face reconstruction for their compact latent spaces and interpretable coefficients. Conventional 3DMM regressors, such as DECA, EMOCA, and SMIRK, are typically trained via analysis-by-synthesis using only single-view photometric and facial landmark supervision. In these regimes, the regressors tend to overfit to 2D cues and can "explain away" view-dependent appearance changes by spuriously altering shape (), expression (), or pose parameters. Quantitatively, the variance in predicted and across different views for the same subject can exceed 2–4 PCA dimensions, undermining cross-view consistency.
In contrast, recent LRMs, such as those trained on the Objaverse dataset with multi-view photometric supervision (e.g., the method of Hong et al., ICLR ’24), excel in encoding stable, robust 3D representations via triplane radiance fields. However, their outputs are typically unstructured (e.g., implicit volumes/radiance fields) and not directly exploitable for downstream tasks requiring explicit surface correspondence, editing, or animation.
KaoLRM fuses these paradigms by "retargeting" the powerful, pre-trained multi-view 3D prior of an LRM into a lightweight, 3DMM-style regressor, and tightly linking mesh and appearance domains through a Gaussian Splatting-based renderer. This yields both robust cross-view identity/expression consistency and the flexibility of parametric 3D model editing.
2. Architecture and Pipeline
The KaoLRM architecture comprises three main modules: (a) a frozen LRM triplane backbone for 3D-aware feature extraction, (b) parametric regression for FLAME coefficient prediction, and (c) a differentiable 2D Gaussian Splatting appearance model.
a) Pre-trained LRM and Triplane Extraction
The LRM backbone uses a frozen DINOv2 feature extractor, followed by a camera embedding for view conditioning and a transformer that lifts 2D features into a triplane representation. The triplanes encode features along three orthogonal axes. Any 3D spatial position can be queried via bilinear lookups and sum-pooling to produce a feature . During KaoLRM training, all LRM weights are frozen except a small camera embedder, facilitating stable and data-efficient domain adaptation.
b) FLAME Parameter Regression
The triplane feature tokens are flattened and gated by a self-gating MLP that outputs importance scores . The gated features are pooled and provided to a regression MLP:
where
- : identity PCA coefficients,
- : expression PCA coefficients,
- : global head and jaw rotation,
- , : scale and translation.
FLAME mesh vertices are reconstructed as:
with applying blendshapes and linear blend skinning to output vertex positions.
c) 2D Gaussian Splatting for Appearance
To address the inefficiency of implicit LRM radiance field rendering, KaoLRM samples surface points on using differentiable barycentric interpolation. Each is mapped to appearance attributes by the triplane-based MLP:
- : opacity,
- : rotation quaternion,
- : scales,
- : color.
Each point is projected into image space as a 2D Gaussian disc with center , covariance . The rendered image is produced by alpha-blending the set of 2D Gaussians:
where is a normalized Gaussian kernel.
3. Loss Functions and Supervision
KaoLRM is trained using an analysis-by-synthesis objective that balances landmark accuracy, regularization, photometric fidelity, and mesh/appearance consistency:
Key components:
- : Squared distance between 68 predicted and ground-truth landmarks.
- : Blended photometric loss, supplemented by D-SSIM and VGG perceptual terms; blends face region (mask ) and background by weight .
- : Enforces consistency between mesh and appearance depth/normal outputs within the face region.
- : Regularization on , magnitude.
This composite objective ensures that geometry and appearance remain tightly coupled and penalizes geometric drift under self-occlusion and difficult viewpoints.
4. Training Procedure and Datasets
KaoLRM leverages the pre-trained OpenLRM image-to-triplane network (trained on Objaverse, 100K objects), freezing all but the camera embedder. Training comprises two stages:
- Stage I: 50K iterations, optimizing to fix orientation and scale.
- Stage II: 150K further iterations, introducing and .
Datasets include:
- Controlled multi-view head scans (FaceScape, Multiface, FaceVerse, Headspace): 32 random views/asset within the frontal hemisphere.
- In-the-wild faces (FFHQ, CelebA): single view/subject, field of view .
Training is performed at input resolution , render size , with batch size (subjects views per GPU) on 4 NVIDIA A100 GPUs (4 days multi-view, 6 days in-the-wild).
5. Quantitative and Qualitative Results
KaoLRM was evaluated on the FaceVerse dataset (using the NoW alignment protocol) and the NoW benchmark under multiple conditions (neutral, expressions, occlusions, selfies). The results demonstrate notably improved geometry and cross-view stability compared to DECA, EMOCA, and SMIRK regressors:
| Metric | DECA | EMOCA | SMIRK | KaoLRM |
|---|---|---|---|---|
| FaceVerse Chamfer (×10⁻²) | 3.17 | 3.15 | 3.20 | 2.68 |
| Var() (shape) | 2.02 | 2.01 | 8.48 | 1.54 |
| Var() (expression) | 2.48 | 4.67 | 47.0 | 1.10 |
| NoW Chamfer (mm) | 1.24 | 1.21 | 1.02 | 0.99 |
Qualitatively, KaoLRM produces meshes that remain stable under head rotation, with lower geometric drift and more reliable expression recovery. The binding of parametric mesh and appearance avoids the artifacts and instability seen in single-view-trained 3DMM regressors (Zhu et al., 19 Jan 2026).
6. Significance and Implications
By integrating a 3D-aware, frozen LRM feature extractor with a parametric regression and 2D Gaussian Splatting rendering pipeline, KaoLRM achieves cross-view consistency, robustness to self-occlusion, and interpretable/reusable mesh representations. This architecture demonstrates the effectiveness of leveraging large-scale, multi-view pretraining for 3D priors in downstream tasks traditionally hampered by data scarcity and viewpoint generalization. A plausible implication is the applicability of similar feature-retargeting strategies for other structured mesh regression tasks that demand both robustness and editability. The release of code and models supports reproducibility and downstream research applications.