Papers
Topics
Authors
Recent
Search
2000 character limit reached

KaoLRM: 3D Face Reconstruction Neural Architecture

Updated 26 January 2026
  • KaoLRM is a neural architecture that repurposes a pre-trained LRM for FLAME-based 3D face regression, addressing cross-view consistency issues.
  • It fuses multi-view-trained triplane features from a frozen DINOv2 backbone with a parametric mesh regression and 2D Gaussian Splatting for appearance modeling.
  • Evaluations on FaceVerse and NoW benchmarks show reduced geometric variance and improved stability compared to conventional 3DMM regressors.

KaoLRM is a neural architecture for parametric 3D face reconstruction that repurposes the learned prior of a pre-trained Large Reconstruction Model (LRM) and fuses it with a parametric 3D Morphable Model (3DMM) regression pipeline. It targets the longstanding challenge in 3DMM-based facial reconstruction of achieving accurate, robust, and cross-view-consistent shape and expression prediction from single-view images. By projecting LRM’s multi-view-trained 3D triplane features into the FLAME parameter space and coupling a 2D Gaussian Splatting module to the parametric mesh, KaoLRM achieves high-fidelity geometry, reliable appearance modeling, and substantially improved consistency under viewpoint variability or self-occlusion (Zhu et al., 19 Jan 2026).

1. Motivation and Background

Parametric 3DMMs, especially FLAME, are widely used in 3D face reconstruction for their compact latent spaces and interpretable coefficients. Conventional 3DMM regressors, such as DECA, EMOCA, and SMIRK, are typically trained via analysis-by-synthesis using only single-view photometric and facial landmark supervision. In these regimes, the regressors tend to overfit to 2D cues and can "explain away" view-dependent appearance changes by spuriously altering shape (β\beta), expression (ψ\psi), or pose parameters. Quantitatively, the variance in predicted β\beta and ψ\psi across different views for the same subject can exceed 2–4 PCA dimensions, undermining cross-view consistency.

In contrast, recent LRMs, such as those trained on the Objaverse dataset with multi-view photometric supervision (e.g., the method of Hong et al., ICLR ’24), excel in encoding stable, robust 3D representations via triplane radiance fields. However, their outputs are typically unstructured (e.g., implicit volumes/radiance fields) and not directly exploitable for downstream tasks requiring explicit surface correspondence, editing, or animation.

KaoLRM fuses these paradigms by "retargeting" the powerful, pre-trained multi-view 3D prior of an LRM into a lightweight, 3DMM-style regressor, and tightly linking mesh and appearance domains through a Gaussian Splatting-based renderer. This yields both robust cross-view identity/expression consistency and the flexibility of parametric 3D model editing.

2. Architecture and Pipeline

The KaoLRM architecture comprises three main modules: (a) a frozen LRM triplane backbone for 3D-aware feature extraction, (b) parametric regression for FLAME coefficient prediction, and (c) a differentiable 2D Gaussian Splatting appearance model.

a) Pre-trained LRM and Triplane Extraction

The LRM backbone uses a frozen DINOv2 feature extractor, followed by a camera embedding for view conditioning and a transformer that lifts 2D features into a triplane representation. The triplanes T={Tx,Ty,Tz}\mathcal{T} = \{\mathcal{T}_x, \mathcal{T}_y, \mathcal{T}_z\} encode features TiRC×H×W{\mathcal{T}_i} \in \mathbb{R}^{C \times H \times W} along three orthogonal axes. Any 3D spatial position xx can be queried via bilinear lookups and sum-pooling to produce a feature Ftri(x)RCF_{\mathrm{tri}}(x) \in \mathbb{R}^C. During KaoLRM training, all LRM weights are frozen except a small camera embedder, facilitating stable and data-efficient domain adaptation.

b) FLAME Parameter Regression

The triplane feature tokens are flattened and gated by a self-gating MLP that outputs importance scores σj=sigmoid(MLP(tj))\sigma_j = \mathrm{sigmoid}(\mathrm{MLP}(t_j)). The gated features are pooled and provided to a regression MLP:

(β,ψ,θ,s,t)=freg(Ftri)(\beta,\,\psi,\,\theta,\,s,\,t) = f_{\mathrm{reg}}(F_{\mathrm{tri}})

where

  • βR100\beta \in \mathbb{R}^{100}: identity PCA coefficients,
  • ψR50\psi \in \mathbb{R}^{50}: expression PCA coefficients,
  • θR6\theta \in \mathbb{R}^6: global head and jaw rotation,
  • sR+s \in \mathbb{R}_+, tR3t \in \mathbb{R}^3: scale and translation.

FLAME mesh vertices are reconstructed as:

M=sV(β,ψ,θ)+t\mathcal{M} = s\,\mathbf{V}(\beta, \psi, \theta) + t

with V\mathbf{V} applying blendshapes and linear blend skinning to output 3×50233 \times 5023 vertex positions.

c) 2D Gaussian Splatting for Appearance

To address the inefficiency of implicit LRM radiance field rendering, KaoLRM samples N8,000N \approx 8{,}000 surface points xix_i on M\mathcal{M} using differentiable barycentric interpolation. Each xix_i is mapped to appearance attributes by the triplane-based MLP:

  • wi(0,1)w_i \in (0, 1): opacity,
  • qiS3q_i \in S^3: rotation quaternion,
  • su,i,sv,is_{u,i}, s_{v,i}: scales,
  • ci[0,1]3c_i \in [0, 1]^3: color.

Each point is projected into image space as a 2D Gaussian disc with center μi\mu_i, covariance Σi\Sigma_i. The rendered image is produced by alpha-blending the set of 2D Gaussians:

I(u,v)=i=1NwiG([u,v]T;μi,Σi)ciI(u, v) = \sum_{i=1}^{N} w_i\,\mathcal{G}([u, v]^T; \mu_i, \Sigma_i)\,c_i

where G\mathcal{G} is a normalized Gaussian kernel.

3. Loss Functions and Supervision

KaoLRM is trained using an analysis-by-synthesis objective that balances landmark accuracy, regularization, photometric fidelity, and mesh/appearance consistency:

Ltotal=wlmkLlmk+wreg(β2+ψ2)+wphotLphotometric+wbindLbindingL_{\mathrm{total}} = w_{\mathrm{lmk}}\,L_{\mathrm{lmk}} + w_{\mathrm{reg}}(\|\beta\|^2+\|\psi\|^2) + w_{\mathrm{phot}}\,L_{\mathrm{photometric}} + w_{\mathrm{bind}}\,L_{\mathrm{binding}}

Key components:

  • LlmkL_{\mathrm{lmk}}: Squared 2\ell_2 distance between 68 predicted and ground-truth landmarks.
  • LphotometricL_{\mathrm{photometric}}: Blended photometric loss, supplemented by D-SSIM and VGG perceptual terms; blends face region (mask mm) and background by weight λ=0.7\lambda = 0.7.
  • LbindingL_{\mathrm{binding}}: Enforces consistency between mesh and appearance depth/normal outputs within the face region.
  • LregL_{\mathrm{reg}}: Regularization on β\beta, ψ\psi magnitude.

This composite objective ensures that geometry and appearance remain tightly coupled and penalizes geometric drift under self-occlusion and difficult viewpoints.

4. Training Procedure and Datasets

KaoLRM leverages the pre-trained OpenLRM image-to-triplane network (trained on Objaverse, \sim100K objects), freezing all but the camera embedder. Training comprises two stages:

  • Stage I: \sim50K iterations, optimizing Llmk+LregL_{\mathrm{lmk}} + L_{\mathrm{reg}} to fix orientation and scale.
  • Stage II: \sim150K further iterations, introducing LphotometricL_{\mathrm{photometric}} and LbindingL_{\mathrm{binding}}.

Datasets include:

  • Controlled multi-view head scans (FaceScape, Multiface, FaceVerse, Headspace): \sim32 random views/asset within the frontal hemisphere.
  • In-the-wild faces (FFHQ, CelebA): single view/subject, field of view 14.25\approx 14.25^\circ.

Training is performed at input resolution 224×224224 \times 224, render size 192×192192 \times 192, with batch size 16×416 \times 4 (subjects ×\times views per GPU) on 4 ×\times NVIDIA A100 GPUs (\sim4 days multi-view, \sim6 days in-the-wild).

5. Quantitative and Qualitative Results

KaoLRM was evaluated on the FaceVerse dataset (using the NoW alignment protocol) and the NoW benchmark under multiple conditions (neutral, expressions, occlusions, selfies). The results demonstrate notably improved geometry and cross-view stability compared to DECA, EMOCA, and SMIRK regressors:

Metric DECA EMOCA SMIRK KaoLRM
FaceVerse Chamfer (×10⁻²) 3.17 3.15 3.20 2.68
Var(β\beta) (shape) 2.02 2.01 8.48 1.54
Var(ψ\psi) (expression) 2.48 4.67 47.0 1.10
NoW Chamfer (mm) 1.24 1.21 1.02 0.99

Qualitatively, KaoLRM produces meshes that remain stable under ±45\pm 45^\circ head rotation, with lower geometric drift and more reliable expression recovery. The binding of parametric mesh and appearance avoids the artifacts and instability seen in single-view-trained 3DMM regressors (Zhu et al., 19 Jan 2026).

6. Significance and Implications

By integrating a 3D-aware, frozen LRM feature extractor with a parametric regression and 2D Gaussian Splatting rendering pipeline, KaoLRM achieves cross-view consistency, robustness to self-occlusion, and interpretable/reusable mesh representations. This architecture demonstrates the effectiveness of leveraging large-scale, multi-view pretraining for 3D priors in downstream tasks traditionally hampered by data scarcity and viewpoint generalization. A plausible implication is the applicability of similar feature-retargeting strategies for other structured mesh regression tasks that demand both robustness and editability. The release of code and models supports reproducibility and downstream research applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to KaoLRM.