LiNO-UniPS: Unified Photometric Stereo

Updated 28 January 2026

The paper introduces LiNO-UniPS, a deep learning framework that unifies lighting-invariant feature encoding with wavelet-based detail recovery for robust surface normal estimation.
It employs novel light-register tokens and interleaved multi-image attention to decouple illumination effects from intricate geometric details in multi-illumination images.
The approach integrates cascaded attention blocks with wavelet-aware upsampling, achieving state-of-the-art performance on standard universal photometric stereo benchmarks.

LiNO-UniPS denotes "Light of Normals: Unified Feature Representation for Universal Photometric Stereo," a deep learning framework targeting robust surface normal estimation from multi-illumination image sets without lighting, reflectance, or segmentation assumptions. It advances the universal photometric stereo (UPS) paradigm by unifying lighting–invariant feature encoding, detail-rich geometric recovery, and multi-domain learning in a single model architecture, validated on diverse benchmarks and synthetic datasets (Li et al., 23 Jun 2025).

1. Universal Photometric Stereo: Formulation and Challenges

Universal photometric stereo seeks to estimate the per-pixel normal map $N \in \mathbb R^{H\times W\times 3}$ of an object given a fixed-view collection $\{I_f \in \mathbb R^{H\times W\times3}\}_{f=1}^F$ of images under unknown and potentially complex lighting, without presuming known light directions or simple reflectance. The underlying image-formation process is modeled as

$I_f(p) = \mathcal F\big(n(p), L_f, m(p)\big)$

where $p=(x,y)$ is the image pixel, $n(p)$ is the surface normal, $L_f$ aggregates lighting parameters for the $f$ -th view (environment, point, area), and $m(p)$ describes material/BRDF. Unlike classical calibrated or Lambertian PS, both $L_f$ and $m(\cdot)$ are unknown and may include non-Lambertian and spatially-varying characteristics.

The practical goal is to learn a feed-forward mapping $\{I_f \in \mathbb R^{H\times W\times3}\}_{f=1}^F$ 0 robust across lighting and reflectance ambiguity. The main technical obstacles are: (1) disentangling illumination from surface normal features where intensity variation is ambiguous; and (2) preserving high-frequency detail in surface geometry when image cues are intricate or degraded by shadow, interreflection, and nonideal lighting (Li et al., 23 Jun 2025).

2. Lighting-Invariant, Detail-Rich Feature Encoding

LiNO-UniPS introduces a multi-component encoder achieving both lighting invariance and fine geometric fidelity through three primary innovations:

Light-Register Tokens: Three learnable tokens ( $\{I_f \in \mathbb R^{H\times W\times3}\}_{f=1}^F$ 1) are prepended to every per-image token stream, acting as attention gatherers for environment (HDRI), point, and area lighting cues across all frames. Self-attention aggregates global illumination context through these tokens.
Interleaved Multi-Image Attention: After ViT-style patch embedding, tokens are processed through four cascaded attention blocks (Frame, LightAxis, Global, LightAxis), alternating between intra-image, inter-image, and global contextualization. This enables information flow that disentangles lighting effects from geometry and fosters global spatial awareness.
Wavelet-Aware Down/Up Sampling: To maintain high-frequency geometry, each $\{I_f \in \mathbb R^{H\times W\times3}\}_{f=1}^F$ 2 is decomposed into a bilinearly downsampled stream ( $\{I_f \in \mathbb R^{H\times W\times3}\}_{f=1}^F$ 3) and four discrete Haar-type wavelet bands ( $\{I_f \in \mathbb R^{H\times W\times3}\}_{f=1}^F$ 4), which are separately tokenized and jointly encoded. A WaveUpSampler later performs inverse DWT followed by summation and smoothing to reconstruct the full-detail encoder output $\{I_f \in \mathbb R^{H\times W\times3}\}_{f=1}^F$ 5.

The fused encoder representation $\{I_f \in \mathbb R^{H\times W\times3}\}_{f=1}^F$ 6 is explicitly designed to be invariant to lighting changes and rich in normal-sensitive, high-spatial-frequency content.

3. Network Architecture and Processing Pipeline

LiNO-UniPS adopts an encoder–decoder paradigm, integrating specialized stages for illumination-context integration and gradient-preserving detail recovery:

Input Preprocessing: $\{I_f \in \mathbb R^{H\times W\times3}\}_{f=1}^F$ 7 images are decomposed into low-pass and high-frequency wavelet bands, each receiving three light tokens.
Patch Embedding & Transformer Backbone: Each stream is patch-embedded (patch size $\{I_f \in \mathbb R^{H\times W\times3}\}_{f=1}^F$ 8, $\{I_f \in \mathbb R^{H\times W\times3}\}_{f=1}^F$ 9 embedding), producing tokens processed by a DINOv2-initialized ViT backbone.
Cascaded Contextual Attention: Four interleaved attention blocks reinforce lighting–normal decoupling and spatial detail aggregation.
Multi-Scale Feature Fusion: A DPT-based fusion module forms a four-level spatial pyramid, fused via residual convolution blocks to ensure feature consistency and spatial resolution.
Wavelet-Based Detail Synthesis: The WaveUpSampler upsamples and reconstructs spatial features, using inverse DWT and smoothing for final $I_f(p) = \mathcal F\big(n(p), L_f, m(p)\big)$ 0.
Pixel-Sampling Decoder: $I_f(p) = \mathcal F\big(n(p), L_f, m(p)\big)$ 1 pixel locations per scene are randomly sampled. For each pixel, corresponding $I_f(p) = \mathcal F\big(n(p), L_f, m(p)\big)$ 2 and a high-dim projection of $I_f(p) = \mathcal F\big(n(p), L_f, m(p)\big)$ 3 form the input to pooling-by-multihead-attention and frame/light-axis self-attention, followed by a 2-layer MLP to predict the normal $I_f(p) = \mathcal F\big(n(p), L_f, m(p)\big)$ 4. Full maps are assembled by interpolation.

This architecture allows selective attention to both pixel-level and scene-global information, yielding robust reconstruction in complex lighting and material regimes.

4. Supervision: Loss Functions and Training

The training objective combines several loss components targeting both lighting disentanglement and accurate normal regression:

Lighting Alignment Losses: For synthetic data with ground-truth lighting, each lighting parameter (HDRI, point, area) is mapped to a $I_f(p) = \mathcal F\big(n(p), L_f, m(p)\big)$ 5-dim vector via MLPs and encouraged to align with the learned light tokens via cosine similarity. Losses:

$I_f(p) = \mathcal F\big(n(p), L_f, m(p)\big)$ 6

Normal Regression and Gradient Loss: Predictions $I_f(p) = \mathcal F\big(n(p), L_f, m(p)\big)$ 7 and their gradients $I_f(p) = \mathcal F\big(n(p), L_f, m(p)\big)$ 8 are compared to ground truth via a confidence-weighted quadratic loss and explicit gradient loss:

$I_f(p) = \mathcal F\big(n(p), L_f, m(p)\big)$ 9

where $p=(x,y)$ 0.

Total Loss:

$p=(x,y)$ 1

With $p=(x,y)$ 2 and others adaptively set to keep auxiliary terms at $p=(x,y)$ 3 the main loss.

Training uses the PS-Verse synthetic set (100K rendered scenes over five complexity/material/illumination levels), AdamW optimization, and progressive fine-tuning from simple to texture-rich scenes.

5. Benchmark Performance and Empirical Analysis

LiNO-UniPS achieves state-of-the-art normal estimation across standard universal PS benchmarks:

Ablation analyses on PS-Verse show each model component (light tokens, global attention, loss terms, wavelets, normal-gradient loss) cumulatively reduces mean angular error (MAE, up to $p=(x,y)$ 4), with steady improvements in encoder feature SSIM/CSIM.

Quantitative results:

DiLiGenT: LiNO-UniPS MAE $p=(x,y)$ 5 (vs Uni MS-PS $p=(x,y)$ 6, SDM-UniPS $p=(x,y)$ 7, UniPS $p=(x,y)$ 8), SOTA on most objects.
LUCES: MAE $p=(x,y)$ 9 (vs Uni MS-PS $n(p)$ 0).
DiLiGenT $n(p)$ 1: Error matrices uniformly lower than prior art.

Feature–accuracy correlation: Encoder feature CSIM/SSIM tightly predicts per-pixel normal accuracy, confirming the efficacy of the lighting-invariant, detail-rich representation.

High-resolution, real-world validation: 2K–4K, mask-free, complex-object scenes demonstrate superior fidelity to commercial 3D scanner outputs.

6. Strengths, Limitations, and Future Development

Strengths:

Decoupling of lighting and geometric features using light tokens and global attention.
Preservation of high-frequency detail by wavelet-based sampling and gradient-aware supervision.
Strong cross-domain generalization (materials, lighting, sparse input counts; robust results for as few as $n(p)$ 2 lights).
ECSIM/SSIM-based consistency of encoder features closely tracks output accuracy.

Limitations and directions for extension:

Global attention stages pose high computational cost; sparser mechanisms may lower resource use.
Occasional normal-flip ambiguity on near-planar regions in absence of explicit lighting cues.
Current approach is single-view only; extension to multi-view could exploit geometric consistency.
No explicit BRDF/material recovery; joint estimation with normals could enhance scene understanding.

7. Summary and Context

LiNO-UniPS represents a unification of illumination-invariant feature learning, detail-sensitive geometry representation, and scaled synthetic-supervised training for universal photometric stereo. The approach leverages light-register tokenization, interleaved attention, and wavelet-aware processing to overcome prevailing ambiguities in surface normal estimation under arbitrary lighting, reliably outperforming prior models on a broad spectrum of synthetic and real benchmarks (Li et al., 23 Jun 2025). The architecture and training strategies serve as a reference point for future research in photometric geometric learning and domain-agnostic scene reconstruction.

Markdown Report Issue Upgrade to Chat

References (1)

Light of Normals: Unified Feature Representation for Universal Photometric Stereo (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LiNO-UniPS.