Papers
Topics
Authors
Recent
Search
2000 character limit reached

MatPedia: Unified Model for PBR Synthesis

Updated 16 January 2026
  • MatPedia Foundation Model is a universal generative architecture that unites natural image appearance with PBR maps using a dual-latent RGB-PBR representation.
  • It employs a 3D VAE and video diffusion transformer, enabling text-to-material, image-to-material, and intrinsic decomposition within a unified framework.
  • Empirical evaluations show high-fidelity synthesis with improved metrics over existing methods, supporting rapid generation and correction of complex material textures.

MatPedia Foundation Model is a universal generative architecture for high-fidelity synthesis and analysis of physically-based rendering (PBR) materials. It establishes a joint framework capable of bridging natural image appearance and physically-parameterized PBR maps, supporting tasks such as text-to-material and image-to-material generation as well as intrinsic decomposition, all within a unified model. MatPedia achieves this by leveraging a dual-latent RGB-PBR representation and a video diffusion backbone, trained on a hybrid corpus integrating real-world image and structured material data (Luo et al., 21 Nov 2025).

1. Motivation and Challenges in Material Synthesis

Physically-based rendering of surfaces critically depends on structured sets of PBR maps: basecolor, normal, roughness, and metallic. The creation of such assets is traditionally manual, requiring domain expertise and specialized equipment. Prior generative models for material synthesis tend to fall into one of two categories: (1) methods restricted to PBR-only datasets, resulting in limited material diversity and visual fidelity; or (2) approaches focusing on sub-tasks (e.g., intrinsic decomposition), necessitating fragmented pipelines and yielding weak generalization across modalities. A key challenge—left unaddressed by earlier models—is the development of a representation and framework capable of uniting in-the-wild RGB appearance data with physically meaningful PBR material parameters, thereby enabling both high-fidelity synthesis and a broadening of material diversity (Luo et al., 21 Nov 2025).

2. Joint RGB–PBR Representation

A central innovation of MatPedia is its unified 5-frame representation, which regards each material instance as a sequence of five “frames”—the first corresponding to an RGB shaded image (IrgbI_{\mathrm{rgb}}) and the remaining four to the PBR maps: basecolor (a\mathbf{a}), normal (n\mathbf{n}), roughness (r\mathbf{r}), and metallicity (m\mathbf{m}). This temporal stack enables the model to employ 3D convolutional operations that capture redundancy and cross-correlation between visual appearance and physical properties.

Feature encoding proceeds via two interdependent latent codes:

  • zrgbz_\mathrm{rgb} encapsulates global appearance, color, and texture information.
  • zpbrz_\mathrm{pbr} specializes in physical attributes such as microfacet parameters, fine normal detail, and metallic-vs-dielectric delineation.

Formally (with Ergb\mathcal{E}_{\mathrm{rgb}} as the RGB encoder, Fenc\mathcal{F}_{\mathrm{enc}} the intermediate RGB feature cache, Epbr\mathcal{E}_{\mathrm{pbr}} the PBR encoder):

zrgb=Ergb(Irgb), zpbr=Epbr([Fenc(zrgb),a,n,r,m])\mathbf{z}_{\mathrm{rgb}} = \mathcal{E}_{\mathrm{rgb}}\left(I_{\mathrm{rgb}}\right), \ \mathbf{z}_{\mathrm{pbr}} = \mathcal{E}_{\mathrm{pbr}}\Big(\left[\mathcal{F}_{\mathrm{enc}}(\mathbf{z}_{\mathrm{rgb}}), \mathbf{a}, \mathbf{n}, \mathbf{r}, \mathbf{m}\right]\Big)

Decoding mirrors this split, producing reconstructed RGB images and PBR maps conditioned on the appropriate latent codes (Luo et al., 21 Nov 2025).

3. Architecture Overview

3.1 Latent Compression and Encoding

MatPedia relies on a 3D VAE, following the Wan2.2-VAE design. Spatial and temporal downsampling is applied to the 5-frame input stack, typically reducing the first frame (RGB) by 16×16\times spatially and subsequent frames by 16×16\times (spatial) and 4×4\times (temporal), yielding latent tensors of shape [1+T/4,H/16,W/16][1+T/4, H/16, W/16]. Only the VAE decoder is fine-tuned; the encoder remains frozen to preserve pretrained visual priors.

The VAE reconstruction loss combines pixel-wise 1\ell_1 error and perceptual loss in the VGG feature space:

LVAE=λ1x^x1+λ2ϕ(x^)ϕ(x)22\mathcal{L}_{\mathrm{VAE}} = \lambda_1 \, \|\hat{\mathbf{x}}-\mathbf{x}\|_{1} + \lambda_2 \, \|\phi(\hat{\mathbf{x}})-\phi(\mathbf{x})\|_2^2

where x\mathbf{x} is the ground truth 5-frame stack and ϕ\phi denotes the VGG feature extractor.

3.2 Video Diffusion Transformer Backbone

MatPedia employs a video DiT (Diffusion Transformer), initialized on standard video diffusion checkpoints (e.g., HunyuanVideo), and adapted using LoRA adapters on attention and feedforward components. The DiT operates directly on the concatenated zrgbz_\mathrm{rgb} and zpbrz_\mathrm{pbr} latent space, enabling end-to-end diffusion-based synthesis and sampling for all target tasks. Conditional generation is controlled by either text embeddings or observed image features.

The principal diffusion training loss is defined via rectified flow matching:

LRF=Ex0,x1,t[vθ(xt,t,c)(x0x1)22],\mathcal{L}_{\mathrm{RF}} = \mathbb{E}_{\mathbf{x}_0,\mathbf{x}_1,t}\left[ \left\lVert v_\theta(\mathbf{x}_t, t, \mathbf{c}) - (\mathbf{x}_0 - \mathbf{x}_1)\right\rVert_2^2 \right],

where xt=(1t)x0+tx1\mathbf{x}_t = (1-t)\mathbf{x}_0 + t\mathbf{x}_1, and conditioning c\mathbf{c} is determined by task context (Luo et al., 21 Nov 2025).

4. MatHybrid-410K Training Corpus

Training leverages MatHybrid-410K, a hybridized dataset combining two complementary sources:

  • RGB-only Appearance Data (~50k images):
    • Procedural/synthetic planar renderings generated via Gemini 2.5 Flash Image.
    • Real photographs of flat surfaces, with programmatically generated text captions (Qwen2.5-VL) to enable text-to-material supervision.
  • Complete PBR SVBRDF Dataset (~6k SVBRDFs yielding ~360k image pairs):
    • Sourced from Matsynth, OpenSVBRDF, and similar datasets.
    • Rendered under 32 distinct HDR environment maps, producing planar and distorted (primitive-mapped) views for image-to-material training.

All samples are processed at native 1024×10241024\times1024 resolution. No tileability augmentation was performed, though future expansions are suggested. Cropping enforces 70%\geq70\% material coverage, ensuring training samples exhibit sufficient content for modeling (Luo et al., 21 Nov 2025).

5. Unified Material Tasks and Conditioning

The MatPedia architecture unifies three material analysis and synthesis tasks, each handled via task-specific conditioning and branches within the common backbone:

Task Input Condition Output
Text-to-material Text prompt (text encoder) RGB + PBR maps (full latents)
Image-to-material Distorted photo (RGB encoder) Rectified RGB + PBR latent pair
Intrinsic decomposition Planar RGB image PBR maps conditioned on RGB
  • Text-to-Material: Supervision alternates between RGB-only data (outputting zrgbz_\mathrm{rgb} only) and paired data (joint RGB/PBR supervision). This enables prompt-driven synthesis with competitive fidelity and diversity.
  • Image-to-Material: Conditioned on an encoded image, the model reconstructs a planarized RGB image and associated PBR maps, facilitating material capture from unconstrained photos.
  • Intrinsic Decomposition: Given a planar RGB input, the latent zrgbz_\mathrm{rgb} is frozen, and the diffusion transformer generates only the PBR latent, supporting separation of albedo and shading effects under unknown lighting.

LoRA-based fine-tuning extends the base weights across tasks, maximizing parameter efficiency and knowledge transfer (Luo et al., 21 Nov 2025).

6. Quantitative and Qualitative Performance

MatPedia demonstrates high-fidelity, high-resolution native 1024×10241024\times1024 material generation, with further upsampling to 4K achieved via RealESRGAN. Key performance comparisons:

  • Text-to-Material (vs. MatFuse):
    • CLIP: 0.283 (MatPedia) vs. 0.261
    • DINO-FID: 1.31 (MatPedia) vs. 1.90
  • Image-to-Material (vs. MatFuse/MaterialPalette):
    • Basecolor CLIP: 0.943 (MatPedia) vs. 0.833 / 0.813
    • DINO (rendered): 0.843 (MatPedia) vs. 0.677 / 0.541
  • Intrinsic Decomposition (vs. MaterialPalette/RGB⇄X):
    • MSE (render): 0.008 (MatPedia) vs. 0.034 / 0.079
    • LPIPS (render): 0.627 (MatPedia) vs. 0.644 / 0.706

Ablation studies reveal substantial improvements from VAE decoder fine-tuning (e.g., +3.55 dB normal PSNR, +5.20 dB roughness PSNR) and from the hybrid data regime (text CLIP up from 0.275 to 0.283; DINO-FID down from 1.62 to 1.31). Applications demonstrated include rapid procedural generation from text prompts, correction and “cleaning” of smartphone-captured textures, and physically-informed decomposition under complex environmental illumination (Luo et al., 21 Nov 2025).

7. Broader Context and Adaptation

MatPedia’s approach to joint RGB–PBR latent modeling and multi-modal diffusion is strongly aligned with recent trends in foundation models for scientific and materials domains. Analogous strategies in quantum many-body state modeling—for instance, representing local and global features as tokens, conditioning on external parameters via attention, and leveraging fidelity or energy-based supervision—support cross-disciplinary generalization and transfer (Zaklama et al., 12 Dec 2025). Plausible implications include the extension of such unified architectures to encompass not only surface property synthesis but also predictive tasks in computational materials science, provided appropriate atomic-site descriptors and physical global tokens are adopted.

MatPedia establishes a scalable prototype for universal models in material synthesis, combining architectural innovations in representation learning, cross-modal conditioning, and high-resolution diffusion with empirical advances in synthesis fidelity and practical deployment (Luo et al., 21 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MatPedia Foundation Model.