Papers
Topics
Authors
Recent
Search
2000 character limit reached

Augmented Latent Intrinsics (ALI)

Updated 8 February 2026
  • Augmented Latent Intrinsics (ALI) is a framework for image relighting that disentangles scene-intrinsic properties from illumination using dense, pixel-aligned priors and self-supervised objectives.
  • It employs a two-stream encoder and a staged self-supervised training protocol to balance semantic abstraction with photometric fidelity on complex materials.
  • Empirical results demonstrate that ALI outperforms previous methods, achieving improved RMSE, SSIM, and PSNR metrics, particularly on specular and non-diffuse surfaces.

Augmented Latent Intrinsics (ALI) constitutes a framework for image-to-image relighting, aiming to disentangle scene-intrinsic properties from illumination using dense pixel-aligned priors and self-supervised objectives. Unlike classical inverse-graphics pipelines that seek explicit recovery of scene albedo, normals, and shading, or purely latent-intrinsic methods that operate over entangled feature spaces, ALI achieves photometrically faithful image manipulations by fusing hierarchically structured visual priors with learned latent representations. Empirical studies of ALI reveal a fundamental trade-off between semantic abstraction and photometric fidelity: leveraging high-level semantic encoders, a common strategy in vision-language and contrastive representation learning, can degrade performance in physically grounded tasks such as relighting, as critical fine-grained photometric cues are lost. ALI overcomes this by integrating dense, pixel-level feature backbones and applying a staged, self-supervised refinement protocol, achieving robust improvements on complex, view-dependent materials (Xing et al., 1 Feb 2026).

1. Problem Formulation and Motivation

Image relighting, in the context of ALI, seeks to generate a new image I^\hat I of a scene ss under target illumination â„“2\ell_2 given a source image Isâ„“1I_s^{\ell_1} under illumination â„“1\ell_1. The pipeline is formalized as:

I^sℓ1→ℓ2=G(Eintr(Isℓ1),  Elight(Isℓ2))\hat I_s^{\ell_1\to\ell_2} = \mathcal{G}(\mathcal{E}_{\rm intr}(I_s^{\ell_1}),\; \mathcal{E}_{\rm light}(I_s^{\ell_2}))

where Eintr\mathcal{E}_{\rm intr} produces lighting-invariant "intrinsic" features, Elight\mathcal{E}_{\rm light} outputs a global lighting code, and G\mathcal{G} denotes the learned decoder.

In contrast to traditional inverse-graphics which reconstructs explicit maps (albedo A(x)A(x), normal ss0, shading ss1), latent-intrinsic methods instead use hierarchical latent features ss2 (intrinsics) and a vector ss3 (lighting), such that

ss4

Failure modes of purely latent-intrinsic approaches, especially with limited real multi-illumination data, are severe on scenes with strong view-dependent reflectance (e.g., metals, glass): specularities are often misattributed or blurred. The naive hypothesis that stronger semantic encoders (e.g., DINO, CLIP) would resolve these ambiguities is not supported by evidence—in fact, these features induce a loss of photometric granularity crucial for relighting (Xing et al., 1 Feb 2026).

2. Mathematical Structure and Training Objectives

ALI is structured around two encoder streams and a diffusion decoder. Given input ss5, the feature decomposition is:

  • Intrinsic features ss6 (albedo/geometry-like)
  • Lighting code ss7

The relighting function is:

ss8

Training involves minimizing:

  • Reconstruction fidelity: ss9
  • Lighting invariance: â„“2\ell_20
  • Hyperspherical regularization: to enforce uniform feature coverage

These constraints are orchestrated to anchor the intrinsic representation while ensuring relighting generalizes across lighting conditions.

3. Architecture and Pixel-Aligned Feature Fusion

ALI maintains a two-stream encoder architecture. The "semantic" stream â„“2\ell_21 is a frozen, pixel-aligned visual backbone (RADIOv2.5H or MAE), from which a hierarchy of feature maps â„“2\ell_22 is extracted. Each feature map is upsampled to input resolution and concatenated into a per-pixel hypercolumn:

â„“2\ell_23

A lightweight projection module performs additive fusion into the original intrinsic features:

â„“2\ell_24

This mechanism injects dense semantic and photometric information directly at the pixel level, carefully balancing the contextual coverage of the backbone with preservation of high-frequency image structure. Experiments show that while contrastive/semantic encoders (CLIP, DINO) slightly benefit downstream performance, dense reconstructive priors (RADIO, MAE) yield significantly superior relighting accuracy on photometrically challenging surfaces (Xing et al., 1 Feb 2026).

4. Self-Supervised Refinement and Staged Training

ALI employs a three-stage training protocol, mitigating the scarcity of paired real-world relighting data:

  • Stage I: Train encoder fusion (freeze â„“2\ell_25 and decoder; learn intrinsic encoder and projection)
  • Stage II: Decoder alignment (freeze encoders; fine-tune diffusion decoder)
  • Stage III: Self-supervised fine-tuning using a "Lighting Zoo"—synthetic pseudo-pairs sampled from batches where the model's own relighting serves as pseudo-ground truth. The denoising score-matching objective is used:

â„“2\ell_26

Occasional identity relighting steps are mixed to preserve scene content.

Key datasets include MIT MIIW (985 scenes × 25 illuminations) and BigTime (460 scenes × 20–50 illuminations).

5. Empirical Results and Quantitative Analysis

ALI achieves state-of-the-art results in unsupervised relighting benchmarks, especially on scenes with non-diffuse, specular, or metallic materials—categories where semantic context and dense priors are critical. On the MIIW cross-scene benchmark:

  • RMSE: 0.294 (improved over LumiNet's 0.310)
  • SSIM: 0.464 (vs. LumiNet's 0.440)

In in-scene relighting:

  • PSNR: 18.87
  • RMSE: 0.119
  • LPIPS: 0.213
  • SSIM: 0.671

Material-wise breakdown indicates an approximate 6% improvement in SSIM for non-diffuse categories (metal/glass). Qualitative assessments confirm higher sharpness in specular highlights, improved caustics, and analytically correct shadow placements compared to prior art (SA-AE, Latent-Intrinsics, RGB↔X, LumiNet). Ablations demonstrate:

  • Minor or negative impact from high-level semantic encoders (CLIP, DINO)
  • Significant performance gain from dense reconstructive priors (RADIOv2.5H, MAE), with RADIOv2.5H giving the best scores (e.g., PSNR↑18.34, SSIM↑0.596, RMSE↓0.126)
  • Stage-wise training improves geometric fidelity, specular-dynamic quality, and in-the-wild artifact removal sequentially

6. Analysis of the Semantic–Photometric Trade-Off

Experimental results reveal a counter-intuitive phenomenon: increasing the strength of semantic encoder priors degrades relighting performance. Semantic encoders, optimized for invariance and abstraction (e.g., CLIP, DINO), tend to remove the very pixel-level photometric structures necessary for physically plausible relighting. Dense reconstructive backbones such as RADIO and MAE, in contrast, preserve pixel-aligned cues vital for reconstructing directional shadows, specularities, and subtle caustic phenomena. This trade-off, established through both quantitative and ablation analyses, argues against reflexive application of large semantic vision encoders for generative inverse problems involving fine-grained physics (Xing et al., 1 Feb 2026).

7. Limitations and Prospective Directions

Current ALI models are limited by reliance on learned priors rather than explicit 3D geometry. Subtle global effects—caustics, interreflections, or fine-scale albedo variation—may be blurred or misattributed under challenging conditions. Further, ALI can confuse minor albedo differences with illumination, especially with highly atypical materials. Future research avenues include:

  • Integrating single-view geometry estimation into the intrinsic inference stream
  • Leveraging multi-view, view-consistent data to improve physically plausible disentanglement
  • Extending probing methods to other inverse graphics tasks (e.g., explicit reflectance editing, HDR relighting)
  • Systematically clarifying which visual priors optimally support downstream generative tasks

ALI establishes that maximal semantic abstraction is not always compatible with photometric fidelity. Its hybrid approach—merging pixel-aligned visual priors with hierarchical latent intrinsics under self-supervised, multi-stage optimization—offers a robust template for physically grounded generative modeling, particularly in regimes characterized by view-dependent, specular, or complex materials (Xing et al., 1 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Augmented Latent Intrinsics (ALI).