Papers
Topics
Authors
Recent
Search
2000 character limit reached

LiNO-UniPS: Unified Photometric Stereo

Updated 28 January 2026
  • The paper introduces LiNO-UniPS, a deep learning framework that unifies lighting-invariant feature encoding with wavelet-based detail recovery for robust surface normal estimation.
  • It employs novel light-register tokens and interleaved multi-image attention to decouple illumination effects from intricate geometric details in multi-illumination images.
  • The approach integrates cascaded attention blocks with wavelet-aware upsampling, achieving state-of-the-art performance on standard universal photometric stereo benchmarks.

LiNO-UniPS denotes "Light of Normals: Unified Feature Representation for Universal Photometric Stereo," a deep learning framework targeting robust surface normal estimation from multi-illumination image sets without lighting, reflectance, or segmentation assumptions. It advances the universal photometric stereo (UPS) paradigm by unifying lighting–invariant feature encoding, detail-rich geometric recovery, and multi-domain learning in a single model architecture, validated on diverse benchmarks and synthetic datasets (Li et al., 23 Jun 2025).

1. Universal Photometric Stereo: Formulation and Challenges

Universal photometric stereo seeks to estimate the per-pixel normal map N∈RH×W×3N \in \mathbb R^{H\times W\times 3} of an object given a fixed-view collection {If∈RH×W×3}f=1F\{I_f \in \mathbb R^{H\times W\times3}\}_{f=1}^F of images under unknown and potentially complex lighting, without presuming known light directions or simple reflectance. The underlying image-formation process is modeled as

If(p)=F(n(p),Lf,m(p))I_f(p) = \mathcal F\big(n(p), L_f, m(p)\big)

where p=(x,y)p=(x,y) is the image pixel, n(p)n(p) is the surface normal, LfL_f aggregates lighting parameters for the ff-th view (environment, point, area), and m(p)m(p) describes material/BRDF. Unlike classical calibrated or Lambertian PS, both LfL_f and m(â‹…)m(\cdot) are unknown and may include non-Lambertian and spatially-varying characteristics.

The practical goal is to learn a feed-forward mapping {If∈RH×W×3}f=1F\{I_f \in \mathbb R^{H\times W\times3}\}_{f=1}^F0 robust across lighting and reflectance ambiguity. The main technical obstacles are: (1) disentangling illumination from surface normal features where intensity variation is ambiguous; and (2) preserving high-frequency detail in surface geometry when image cues are intricate or degraded by shadow, interreflection, and nonideal lighting (Li et al., 23 Jun 2025).

2. Lighting-Invariant, Detail-Rich Feature Encoding

LiNO-UniPS introduces a multi-component encoder achieving both lighting invariance and fine geometric fidelity through three primary innovations:

  • Light-Register Tokens: Three learnable tokens ({If∈RH×W×3}f=1F\{I_f \in \mathbb R^{H\times W\times3}\}_{f=1}^F1) are prepended to every per-image token stream, acting as attention gatherers for environment (HDRI), point, and area lighting cues across all frames. Self-attention aggregates global illumination context through these tokens.
  • Interleaved Multi-Image Attention: After ViT-style patch embedding, tokens are processed through four cascaded attention blocks (Frame, LightAxis, Global, LightAxis), alternating between intra-image, inter-image, and global contextualization. This enables information flow that disentangles lighting effects from geometry and fosters global spatial awareness.
  • Wavelet-Aware Down/Up Sampling: To maintain high-frequency geometry, each {If∈RH×W×3}f=1F\{I_f \in \mathbb R^{H\times W\times3}\}_{f=1}^F2 is decomposed into a bilinearly downsampled stream ({If∈RH×W×3}f=1F\{I_f \in \mathbb R^{H\times W\times3}\}_{f=1}^F3) and four discrete Haar-type wavelet bands ({If∈RH×W×3}f=1F\{I_f \in \mathbb R^{H\times W\times3}\}_{f=1}^F4), which are separately tokenized and jointly encoded. A WaveUpSampler later performs inverse DWT followed by summation and smoothing to reconstruct the full-detail encoder output {If∈RH×W×3}f=1F\{I_f \in \mathbb R^{H\times W\times3}\}_{f=1}^F5.

The fused encoder representation {If∈RH×W×3}f=1F\{I_f \in \mathbb R^{H\times W\times3}\}_{f=1}^F6 is explicitly designed to be invariant to lighting changes and rich in normal-sensitive, high-spatial-frequency content.

3. Network Architecture and Processing Pipeline

LiNO-UniPS adopts an encoder–decoder paradigm, integrating specialized stages for illumination-context integration and gradient-preserving detail recovery:

  • Input Preprocessing: {If∈RH×W×3}f=1F\{I_f \in \mathbb R^{H\times W\times3}\}_{f=1}^F7 images are decomposed into low-pass and high-frequency wavelet bands, each receiving three light tokens.
  • Patch Embedding & Transformer Backbone: Each stream is patch-embedded (patch size {If∈RH×W×3}f=1F\{I_f \in \mathbb R^{H\times W\times3}\}_{f=1}^F8, {If∈RH×W×3}f=1F\{I_f \in \mathbb R^{H\times W\times3}\}_{f=1}^F9 embedding), producing tokens processed by a DINOv2-initialized ViT backbone.
  • Cascaded Contextual Attention: Four interleaved attention blocks reinforce lighting–normal decoupling and spatial detail aggregation.
  • Multi-Scale Feature Fusion: A DPT-based fusion module forms a four-level spatial pyramid, fused via residual convolution blocks to ensure feature consistency and spatial resolution.
  • Wavelet-Based Detail Synthesis: The WaveUpSampler upsamples and reconstructs spatial features, using inverse DWT and smoothing for final If(p)=F(n(p),Lf,m(p))I_f(p) = \mathcal F\big(n(p), L_f, m(p)\big)0.
  • Pixel-Sampling Decoder: If(p)=F(n(p),Lf,m(p))I_f(p) = \mathcal F\big(n(p), L_f, m(p)\big)1 pixel locations per scene are randomly sampled. For each pixel, corresponding If(p)=F(n(p),Lf,m(p))I_f(p) = \mathcal F\big(n(p), L_f, m(p)\big)2 and a high-dim projection of If(p)=F(n(p),Lf,m(p))I_f(p) = \mathcal F\big(n(p), L_f, m(p)\big)3 form the input to pooling-by-multihead-attention and frame/light-axis self-attention, followed by a 2-layer MLP to predict the normal If(p)=F(n(p),Lf,m(p))I_f(p) = \mathcal F\big(n(p), L_f, m(p)\big)4. Full maps are assembled by interpolation.

This architecture allows selective attention to both pixel-level and scene-global information, yielding robust reconstruction in complex lighting and material regimes.

4. Supervision: Loss Functions and Training

The training objective combines several loss components targeting both lighting disentanglement and accurate normal regression:

  • Lighting Alignment Losses: For synthetic data with ground-truth lighting, each lighting parameter (HDRI, point, area) is mapped to a If(p)=F(n(p),Lf,m(p))I_f(p) = \mathcal F\big(n(p), L_f, m(p)\big)5-dim vector via MLPs and encouraged to align with the learned light tokens via cosine similarity. Losses:

If(p)=F(n(p),Lf,m(p))I_f(p) = \mathcal F\big(n(p), L_f, m(p)\big)6

  • Normal Regression and Gradient Loss: Predictions If(p)=F(n(p),Lf,m(p))I_f(p) = \mathcal F\big(n(p), L_f, m(p)\big)7 and their gradients If(p)=F(n(p),Lf,m(p))I_f(p) = \mathcal F\big(n(p), L_f, m(p)\big)8 are compared to ground truth via a confidence-weighted quadratic loss and explicit gradient loss:

If(p)=F(n(p),Lf,m(p))I_f(p) = \mathcal F\big(n(p), L_f, m(p)\big)9

where p=(x,y)p=(x,y)0.

  • Total Loss:

p=(x,y)p=(x,y)1

With p=(x,y)p=(x,y)2 and others adaptively set to keep auxiliary terms at p=(x,y)p=(x,y)3 the main loss.

Training uses the PS-Verse synthetic set (100K rendered scenes over five complexity/material/illumination levels), AdamW optimization, and progressive fine-tuning from simple to texture-rich scenes.

5. Benchmark Performance and Empirical Analysis

LiNO-UniPS achieves state-of-the-art normal estimation across standard universal PS benchmarks:

Ablation analyses on PS-Verse show each model component (light tokens, global attention, loss terms, wavelets, normal-gradient loss) cumulatively reduces mean angular error (MAE, up to p=(x,y)p=(x,y)4), with steady improvements in encoder feature SSIM/CSIM.

Quantitative results:

  • DiLiGenT: LiNO-UniPS MAE p=(x,y)p=(x,y)5 (vs Uni MS-PS p=(x,y)p=(x,y)6, SDM-UniPS p=(x,y)p=(x,y)7, UniPS p=(x,y)p=(x,y)8), SOTA on most objects.
  • LUCES: MAE p=(x,y)p=(x,y)9 (vs Uni MS-PS n(p)n(p)0).
  • DiLiGenTn(p)n(p)1: Error matrices uniformly lower than prior art.

Feature–accuracy correlation: Encoder feature CSIM/SSIM tightly predicts per-pixel normal accuracy, confirming the efficacy of the lighting-invariant, detail-rich representation.

High-resolution, real-world validation: 2K–4K, mask-free, complex-object scenes demonstrate superior fidelity to commercial 3D scanner outputs.

6. Strengths, Limitations, and Future Development

Strengths:

  • Decoupling of lighting and geometric features using light tokens and global attention.
  • Preservation of high-frequency detail by wavelet-based sampling and gradient-aware supervision.
  • Strong cross-domain generalization (materials, lighting, sparse input counts; robust results for as few as n(p)n(p)2 lights).
  • ECSIM/SSIM-based consistency of encoder features closely tracks output accuracy.

Limitations and directions for extension:

  • Global attention stages pose high computational cost; sparser mechanisms may lower resource use.
  • Occasional normal-flip ambiguity on near-planar regions in absence of explicit lighting cues.
  • Current approach is single-view only; extension to multi-view could exploit geometric consistency.
  • No explicit BRDF/material recovery; joint estimation with normals could enhance scene understanding.

7. Summary and Context

LiNO-UniPS represents a unification of illumination-invariant feature learning, detail-sensitive geometry representation, and scaled synthetic-supervised training for universal photometric stereo. The approach leverages light-register tokenization, interleaved attention, and wavelet-aware processing to overcome prevailing ambiguities in surface normal estimation under arbitrary lighting, reliably outperforming prior models on a broad spectrum of synthetic and real benchmarks (Li et al., 23 Jun 2025). The architecture and training strategies serve as a reference point for future research in photometric geometric learning and domain-agnostic scene reconstruction.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LiNO-UniPS.