MatPedia: Unified Model for PBR Synthesis

Updated 16 January 2026

MatPedia Foundation Model is a universal generative architecture that unites natural image appearance with PBR maps using a dual-latent RGB-PBR representation.
It employs a 3D VAE and video diffusion transformer, enabling text-to-material, image-to-material, and intrinsic decomposition within a unified framework.
Empirical evaluations show high-fidelity synthesis with improved metrics over existing methods, supporting rapid generation and correction of complex material textures.

MatPedia Foundation Model is a universal generative architecture for high-fidelity synthesis and analysis of physically-based rendering (PBR) materials. It establishes a joint framework capable of bridging natural image appearance and physically-parameterized PBR maps, supporting tasks such as text-to-material and image-to-material generation as well as intrinsic decomposition, all within a unified model. MatPedia achieves this by leveraging a dual-latent RGB-PBR representation and a video diffusion backbone, trained on a hybrid corpus integrating real-world image and structured material data (Luo et al., 21 Nov 2025).

1. Motivation and Challenges in Material Synthesis

Physically-based rendering of surfaces critically depends on structured sets of PBR maps: basecolor, normal, roughness, and metallic. The creation of such assets is traditionally manual, requiring domain expertise and specialized equipment. Prior generative models for material synthesis tend to fall into one of two categories: (1) methods restricted to PBR-only datasets, resulting in limited material diversity and visual fidelity; or (2) approaches focusing on sub-tasks (e.g., intrinsic decomposition), necessitating fragmented pipelines and yielding weak generalization across modalities. A key challenge—left unaddressed by earlier models—is the development of a representation and framework capable of uniting in-the-wild RGB appearance data with physically meaningful PBR material parameters, thereby enabling both high-fidelity synthesis and a broadening of material diversity (Luo et al., 21 Nov 2025).

2. Joint RGB–PBR Representation

A central innovation of MatPedia is its unified 5-frame representation, which regards each material instance as a sequence of five “frames”—the first corresponding to an RGB shaded image ( $I_{\mathrm{rgb}}$ ) and the remaining four to the PBR maps: basecolor ( $\mathbf{a}$ ), normal ( $\mathbf{n}$ ), roughness ( $\mathbf{r}$ ), and metallicity ( $\mathbf{m}$ ). This temporal stack enables the model to employ 3D convolutional operations that capture redundancy and cross-correlation between visual appearance and physical properties.

Feature encoding proceeds via two interdependent latent codes:

$z_\mathrm{rgb}$ encapsulates global appearance, color, and texture information.
$z_\mathrm{pbr}$ specializes in physical attributes such as microfacet parameters, fine normal detail, and metallic-vs-dielectric delineation.

Formally (with $\mathcal{E}_{\mathrm{rgb}}$ as the RGB encoder, $\mathcal{F}_{\mathrm{enc}}$ the intermediate RGB feature cache, $\mathcal{E}_{\mathrm{pbr}}$ the PBR encoder):

$\mathbf{a}$ 0

Decoding mirrors this split, producing reconstructed RGB images and PBR maps conditioned on the appropriate latent codes (Luo et al., 21 Nov 2025).

3. Architecture Overview

3.1 Latent Compression and Encoding

MatPedia relies on a 3D VAE, following the Wan2.2-VAE design. Spatial and temporal downsampling is applied to the 5-frame input stack, typically reducing the first frame (RGB) by $\mathbf{a}$ 1 spatially and subsequent frames by $\mathbf{a}$ 2 (spatial) and $\mathbf{a}$ 3 (temporal), yielding latent tensors of shape $\mathbf{a}$ 4. Only the VAE decoder is fine-tuned; the encoder remains frozen to preserve pretrained visual priors.

The VAE reconstruction loss combines pixel-wise $\mathbf{a}$ 5 error and perceptual loss in the VGG feature space:

$\mathbf{a}$ 6

where $\mathbf{a}$ 7 is the ground truth 5-frame stack and $\mathbf{a}$ 8 denotes the VGG feature extractor.

3.2 Video Diffusion Transformer Backbone

MatPedia employs a video DiT (Diffusion Transformer), initialized on standard video diffusion checkpoints (e.g., HunyuanVideo), and adapted using LoRA adapters on attention and feedforward components. The DiT operates directly on the concatenated $\mathbf{a}$ 9 and $\mathbf{n}$ 0 latent space, enabling end-to-end diffusion-based synthesis and sampling for all target tasks. Conditional generation is controlled by either text embeddings or observed image features.

The principal diffusion training loss is defined via rectified flow matching:

$\mathbf{n}$ 1

where $\mathbf{n}$ 2, and conditioning $\mathbf{n}$ 3 is determined by task context (Luo et al., 21 Nov 2025).

4. MatHybrid-410K Training Corpus

Training leverages MatHybrid-410K, a hybridized dataset combining two complementary sources:

RGB-only Appearance Data (~50k images):
- Procedural/synthetic planar renderings generated via Gemini 2.5 Flash Image.
- Real photographs of flat surfaces, with programmatically generated text captions (Qwen2.5-VL) to enable text-to-material supervision.
Complete PBR SVBRDF Dataset (~6k SVBRDFs yielding ~360k image pairs):
- Sourced from Matsynth, OpenSVBRDF, and similar datasets.
- Rendered under 32 distinct HDR environment maps, producing planar and distorted (primitive-mapped) views for image-to-material training.

All samples are processed at native $\mathbf{n}$ 4 resolution. No tileability augmentation was performed, though future expansions are suggested. Cropping enforces $\mathbf{n}$ 5 material coverage, ensuring training samples exhibit sufficient content for modeling (Luo et al., 21 Nov 2025).

5. Unified Material Tasks and Conditioning

The MatPedia architecture unifies three material analysis and synthesis tasks, each handled via task-specific conditioning and branches within the common backbone:

Task	Input Condition	Output
Text-to-material	Text prompt (text encoder)	RGB + PBR maps (full latents)
Image-to-material	Distorted photo (RGB encoder)	Rectified RGB + PBR latent pair
Intrinsic decomposition	Planar RGB image	PBR maps conditioned on RGB

Text-to-Material: Supervision alternates between RGB-only data (outputting $\mathbf{n}$ 6 only) and paired data (joint RGB/PBR supervision). This enables prompt-driven synthesis with competitive fidelity and diversity.
Image-to-Material: Conditioned on an encoded image, the model reconstructs a planarized RGB image and associated PBR maps, facilitating material capture from unconstrained photos.
Intrinsic Decomposition: Given a planar RGB input, the latent $\mathbf{n}$ 7 is frozen, and the diffusion transformer generates only the PBR latent, supporting separation of albedo and shading effects under unknown lighting.

LoRA-based fine-tuning extends the base weights across tasks, maximizing parameter efficiency and knowledge transfer (Luo et al., 21 Nov 2025).

6. Quantitative and Qualitative Performance

MatPedia demonstrates high-fidelity, high-resolution native $\mathbf{n}$ 8 material generation, with further upsampling to 4K achieved via RealESRGAN. Key performance comparisons:

Text-to-Material (vs. MatFuse):
- CLIP: 0.283 (MatPedia) vs. 0.261
- DINO-FID: 1.31 (MatPedia) vs. 1.90
Image-to-Material (vs. MatFuse/MaterialPalette):
- Basecolor CLIP: 0.943 (MatPedia) vs. 0.833 / 0.813
- DINO (rendered): 0.843 (MatPedia) vs. 0.677 / 0.541
Intrinsic Decomposition (vs. MaterialPalette/RGB⇄X):
- MSE (render): 0.008 (MatPedia) vs. 0.034 / 0.079
- LPIPS (render): 0.627 (MatPedia) vs. 0.644 / 0.706

Ablation studies reveal substantial improvements from VAE decoder fine-tuning (e.g., +3.55 dB normal PSNR, +5.20 dB roughness PSNR) and from the hybrid data regime (text CLIP up from 0.275 to 0.283; DINO-FID down from 1.62 to 1.31). Applications demonstrated include rapid procedural generation from text prompts, correction and “cleaning” of smartphone-captured textures, and physically-informed decomposition under complex environmental illumination (Luo et al., 21 Nov 2025).

7. Broader Context and Adaptation

MatPedia’s approach to joint RGB–PBR latent modeling and multi-modal diffusion is strongly aligned with recent trends in foundation models for scientific and materials domains. Analogous strategies in quantum many-body state modeling—for instance, representing local and global features as tokens, conditioning on external parameters via attention, and leveraging fidelity or energy-based supervision—support cross-disciplinary generalization and transfer (Zaklama et al., 12 Dec 2025). Plausible implications include the extension of such unified architectures to encompass not only surface property synthesis but also predictive tasks in computational materials science, provided appropriate atomic-site descriptors and physical global tokens are adopted.

MatPedia establishes a scalable prototype for universal models in material synthesis, combining architectural innovations in representation learning, cross-modal conditioning, and high-resolution diffusion with empirical advances in synthesis fidelity and practical deployment (Luo et al., 21 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (2)

MatPedia: A Universal Generative Foundation for High-Fidelity Material Synthesis (2025)

Attention-Based Foundation Model for Quantum States (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MatPedia Foundation Model.