Papers
Topics
Authors
Recent
Search
2000 character limit reached

WildCap: Hybrid Inverse Rendering

Updated 19 December 2025
  • WildCap is a hybrid inverse rendering method that captures photorealistic facial reflectance from unconstrained smartphone video by combining data-driven 'delighting' with physics-based optimization.
  • It employs a texel-grid spherical harmonics lighting model to adaptively correct baked shadow artifacts and deliver artifact-free facial textures.
  • Joint optimization with a patch-level diffusion prior enforces physical consistency, achieving near studio-quality results even under complex, ambient lighting.

WildCap is a hybrid inverse rendering methodology designed for high-quality facial appearance capture from unconstrained smartphone video, narrowing the quality gap with Light-Stage studio solutions while operating in “the wild” under arbitrary illumination (Han et al., 12 Dec 2025). It integrates a data-driven “delighting” process with physics-motivated model-based optimization, introducing novel approaches to disentangle facial reflectance from confounding lighting and baked shadow artifacts. The resulting pipeline achieves photorealistic recovery of facial reflectance maps, supporting applications requiring relightable, artifact-free face reconstruction without controlled capture environments.

1. Hybrid Inverse Rendering Framework

WildCap’s workflow begins with a short (approximately 30 s) handheld smartphone video under ambient illumination. The preprocessing pipeline computes camera poses using COLMAP, acquires high-quality mesh reconstruction via 2DGS, and performs template mesh registration with Wrap3D. A subset of V16V \approx 16 well-focused frames IrawiI_{\text{raw}}^i is selected for processing.

The first stage applies SwitchLight, a pretrained data-driven network, to convert each frame to a diffuse-albedo image IiI^i simulating uniform white lighting. This step removes most specular highlights and converts input frames into a constrained pseudo-studio regime, but preserves non-physical baked shadow artifacts.

In the second stage, these SwitchLight outputs are projected into a common UV texture atlas IUVRH×W×3I_{UV} \in \mathbb{R}^{H \times W \times 3} using registered geometry. A physically plausible albedo map ARH×W×3A \in \mathbb{R}^{H \times W \times 3} and a local lighting model Γθ\Gamma_\theta are then jointly optimized so that rendering AA under Γθ\Gamma_\theta matches IUVI_{UV}. The key innovations are:

  • A per-texel grid-based spherical harmonics (SH) lighting field, enabling spatially adaptive correction and removal of baked shadow artifacts.
  • A patch-level diffusion prior over reflectance maps (diffuse albedo, detailed normals NdN_d, specular albedo IrawiI_{\text{raw}}^i0), enforcing output consistency with physical facial reflectance.

Final outputs comprise 1K diffuse albedo, normal, and specular maps, a texel-grid SH lighting field for relighting, and optional 4K super-resolved outputs via RCAN.

2. Data-Driven “Delighting” with SwitchLight

The SwitchLight “delighting” network, as described by Kim et al. (2024), uses an encoder–decoder backbone with physics-guided components to model low-frequency diffuse and higher-frequency specular removal. It ingests a single portrait image (e.g., IrawiI_{\text{raw}}^i1 in sRGB) under unknown, potentially complex illumination, outputting a 3-channel diffuse-albedo image IrawiI_{\text{raw}}^i2 as though illuminated by uniform white light.

SwitchLight’s output preserves geometry-driven shading while removing specularities but retains shadow-baking artifacts the network cannot disentangle fully. This step is crucial, as it transforms uncontrolled illumination into a domain amenable to model-based inverse rendering, substantially reducing the optimization burden and isolating shadow artifacts as the principal remaining confounder.

3. Texel-Grid Lighting Model for Artifact Correction

WildCap introduces a texel-grid lighting model to address baked shadows unremovable by global SH or environment maps. Define IrawiI_{\text{raw}}^i3 as a binary shadow mask in UV space indicating locations of baked shadows. Lighting is parameterized with a global SH coefficient vector IrawiI_{\text{raw}}^i4 (IrawiI_{\text{raw}}^i5 for second order), and a local grid IrawiI_{\text{raw}}^i6 (grid step IrawiI_{\text{raw}}^i7).

For each UV coordinate IrawiI_{\text{raw}}^i8, the local SH lighting is interpolated:

IrawiI_{\text{raw}}^i9

Rendering for a texel is performed using Lambertian SH shading:

IiI^i0

where IiI^i1 is albedo, IiI^i2 is the coarse normal, IiI^i3 are SH integrals for Lambertian BRDF, and IiI^i4 are SH basis functions.

In regions where IiI^i5, the grid allows for local “dark SH lights” that cancel baked shadows. Elsewhere, only the smooth global SH is used, preserving physically plausible low-frequency shading.

4. Joint Optimization and Diffusion Prior Integration

The core optimization employs a photometric objective enforcing match between rendered output and SwitchLight UV texture:

IiI^i6

Regularization comprises total variation and negativity loss:

  • Total variation on IiI^i7:

IiI^i8

  • Negativity loss to force shadow SH:

IiI^i9

Combined lighting regularization:

IUVRH×W×3I_{UV} \in \mathbb{R}^{H \times W \times 3}0

A patch-level diffusion prior IUVRH×W×3I_{UV} \in \mathbb{R}^{H \times W \times 3}1, pretrained on 48 Light-Stage scans, models the joint distribution of IUVRH×W×3I_{UV} \in \mathbb{R}^{H \times W \times 3}2 on IUVRH×W×3I_{UV} \in \mathbb{R}^{H \times W \times 3}3 patches. During optimization, reflectance maps are sampled via diffusion posterior sampling—incorporating photometric gradients to steer the prior—while scale ambiguity is resolved by initializing albedo IUVRH×W×3I_{UV} \in \mathbb{R}^{H \times W \times 3}4 from a Light-Stage reference scan IUVRH×W×3I_{UV} \in \mathbb{R}^{H \times W \times 3}5 matched to skin tone.

Full update equations are:

  • Reverse diffusion:

IUVRH×W×3I_{UV} \in \mathbb{R}^{H \times W \times 3}6

  • Clean estimate:

IUVRH×W×3I_{UV} \in \mathbb{R}^{H \times W \times 3}7

  • Posterior sampling (gradient step on IUVRH×W×3I_{UV} \in \mathbb{R}^{H \times W \times 3}8):

IUVRH×W×3I_{UV} \in \mathbb{R}^{H \times W \times 3}9

ARH×W×3A \in \mathbb{R}^{H \times W \times 3}0

where ARH×W×3A \in \mathbb{R}^{H \times W \times 3}1 are step-size schedules and ARH×W×3A \in \mathbb{R}^{H \times W \times 3}2.

5. Implementation and Practical Considerations

The UV atlas is set at ARH×W×3A \in \mathbb{R}^{H \times W \times 3}3 with grid step ARH×W×3A \in \mathbb{R}^{H \times W \times 3}4 yielding ARH×W×3A \in \mathbb{R}^{H \times W \times 3}5 of size ARH×W×3A \in \mathbb{R}^{H \times W \times 3}6. Diffusion runs for ARH×W×3A \in \mathbb{R}^{H \times W \times 3}7 steps (ARH×W×3A \in \mathbb{R}^{H \times W \times 3}8), gradients step ARH×W×3A \in \mathbb{R}^{H \times W \times 3}9, initial learning rate Γθ\Gamma_\theta0 with exponential decay. Photometric fitting of Γθ\Gamma_\theta1 to the texture uses LPIPS plus gradient loss. Maps are upsampled from 1K to 4K using RCAN in approximately eight minutes on NVIDIA RTX 4090 hardware.

Convergence is robust, terminating after Γθ\Gamma_\theta2 steps; empirical stability is observed for modest variations in schedule parameters.

6. Evaluation Against Baselines and Prior Work

Ablation studies validate each component:

  • Omission of SwitchLight (“w/o hybrid”) degrades performance under complex illumination.
  • Removal of texel-grid lighting (“w/o TGL”) leaves persistent shadow artifacts.
  • Excluding the diffusion prior (“w/o prior”) yields physically implausible texture.

Grid step Γθ\Gamma_\theta3 offers the best balance of expressivity and overfitting. Comparisons with in-the-wild baselines (DeFace [Huang et al.], FLARE [Bharadwaj et al.], and variants feeding SwitchLight outputs) show notably improved artifact removal and fidelity for WildCap.

Quantitative reconstruction metrics (averaged over six subjects, PSNR/SSIM/LPIPS):

Method PSNR SSIM LPIPS
DeFace* 22.20 0.9279 0.1192
FLARE* 27.81 0.9411 0.0929
WildCap 28.79 0.9520 0.0610

WildCap nearly matches DoRA [Han et al. 2025] under Light-Stage conditions, closing the gap between uncontrolled and studio capture quality.

Qualitative results highlight clean albedo free of baked shadows, retention of fine skin details, and photorealistic relighting under novel environments.

7. Limitations and Future Directions

WildCap currently depends on a closed-source SwitchLight API for the initial “delighting”; further, automatic estimation of the shadow mask Γθ\Gamma_\theta4 via DiFaReli++ is slow and imprecise, with manual annotation delivering optimal performance. Residual artifacts may persist in cases of sharp shadow boundaries due to SH basis smoothness.

Future work will address these limitations by developing end-to-end portrait-delighting networks with baked-shadow uncertainty, curating large open Light-Stage databases using WildCap for studio capture processing (as in NeRSemble), and exploring higher-order or non-SH local basis expansions for artifact removal.

A plausible implication is that further development of open-source “delighting” and segmentation algorithms will improve WildCap’s accessibility and robustness for uncontrolled facial video acquisition.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WildCap.