Papers
Topics
Authors
Recent
Search
2000 character limit reached

GO-MLVTON: Diffusion-Based Multi-Layer VTON

Updated 27 January 2026
  • GO-MLVTON is a multi-layer virtual try-on approach that explicitly learns pixel-level occlusion relationships between inner and outer garments.
  • It integrates a Garment Occlusion Learning module with a StableDiffusion-based morphing module to achieve realistic deformation and artifact-free layering.
  • Evaluated on the MLG dataset using the LACD metric, GO-MLVTON outperforms previous methods in FID, SSIM, and overall visual coherence.

GO-MLVTON (Garment Occlusion-Aware Multi-Layer Virtual Try-On with Diffusion Models) is a methodology for multi-layer garment virtual try-on (ML-VTON) that jointly addresses the challenges of garment occlusion modeling, spatial deformation, and layer-to-layer visual coherence. By explicitly learning pixel-level occlusion relationships between inner and outer garments, GO-MLVTON produces artifact-free, realistic multi-layer try-on results, addressing limitations in prior single-layer or multi-garment VTON approaches. The system integrates a dedicated Garment Occlusion Learning module, a StableDiffusion-based morphing and fitting module, and is benchmarked on the newly introduced MLG dataset with a layer- and edge-sensitive evaluation metric, LACD (Yu et al., 20 Jan 2026).

1. Multi-Layer Virtual Try-On and the Challenge of Occlusion

Image-based virtual try-on (VTON) research has traditionally focused on either single-garment try-on (SG-VTON), synthesizing one target garment onto a person, or multi-garment (MG-VTON), compositing non-overlapping garments without enforcing physical overlap or occlusion. Real-world usage, however, typically involves dressing multiple, sometimes overlapping, layers (e.g., T-shirt under a jacket), where physical realism demands:

  • Accurate spatial occlusion—outer garments must mask overlapping inner garment regions.
  • Realistic deformation—proper draping and fitting of both inner and outer layers to the target body and pose.
  • Artifact-free compositing—elimination of “bleed-through” or ghosting from occluded inner garment pixels.

Previous methods lack explicit pixel-level occlusion reasoning or the ability to deform and layer garments in a physically plausible manner. GO-MLVTON directly addresses these by introducing two interlinked components: the Garment Occlusion Learning (GOL) module and the Garment Morphing & Fitting (GMF) module built atop a diffusion model.

2. Garment Occlusion Learning (GOL) Module

The GOL module is designed to infer a spatially-varying attention mask that determines which pixels of an inner garment remain visible after layering with an outer garment. The process involves:

  • Dual-encoder feature extraction: Separate encoders EiE_i and EoE_o, each with five layers (composed of downsample convolutions followed by residual blocks), process the inner (gig_i) and outer (gog_o) garment images to yield feature maps fif_i and fof_o.
  • Mapping to occlusion attention: Channel-concatenated features [fofi][f_o\Vert f_i] are passed through a mapping network UU and a final 1×11\times 1 convolution, followed by a sigmoid nonlinearity, producing the occlusion mask A[0,1]1×H×WA \in [0,1]^{1 \times H' \times W'}.
  • Occlusion-weighted inner garment latent: This attention mask is applied element-wise to the latent representation of the inner garment (ziz_i from the shared VAE encoder), yielding ziv=Aziz_{iv} = A \odot z_i.

Loss function: A reconstruction loss enforces that zivz_{iv} matches the visible inner garment as cropped from the real person image xpix_{pi} via

LOCC=ε(xpi)ziv22,\mathcal{L}_{\rm OCC} = \left\| \varepsilon(x_{pi}) - z_{iv} \right\|_2^2,

with λ2=0.1\lambda_2 = 0.1 in the overall objective. This direct supervision ensures the GOL module learns to suppress occluded regions effectively. Backpropagating LOCC\mathcal{L}_{\rm OCC} leads to optimal occlusion-aware masking, especially at garment boundaries.

3. StableDiffusion-Based Garment Morphing & Fitting (GMF) Module

Following occlusion reasoning, the GMF module synthesizes the final try-on output, adapting diffusion modeling to multi-layer conditioning:

  • Latent conditioning: The core conditional input is the concatenation [zazoziv][z_a \Vert z_o \Vert z_{iv}] where zaz_a is the clothing-agnostic person latent (body pose/shape), zoz_o is the outer garment latent, and zivz_{iv} is the occlusion-weighted inner garment latent.
  • Diffusion UNet: A StableDiffusion v1.5 UNet, with cross-attention layers removed for architectural alignment with ML-VTON tasks, jointly models the diffusion denoising process conditioned on these latents as well as a binary inpainting mask for masked person areas.
  • Training objective: The denoising loss per diffusion step tt is

LGMF=Et,ϵϵϵθ(zt,t,zint,min)22,\mathcal{L}_{\rm GMF} = \mathbb{E}_{t, \epsilon} \left\| \epsilon - \epsilon_\theta(z^t, t, z^t_{\rm in}, m_{\rm in}) \right\|_2^2,

and the combined objective is

L=LGMF+λ2LOCC.\mathcal{L} = \mathcal{L}_{\rm GMF} + \lambda_2\mathcal{L}_{\rm OCC}.

  • Optimization and inference: Model initialization starts from InstructPix2Pix-pretrained StableDiffusion weights; most weights remain frozen except for GOL and self-attention layers. AdamW is used with learning rate 1×1051\times10^{-5}, batch size $8$, and 10%10\% conditional dropout. Inference incorporates classifier-free guidance (CFG) with scale s=2.5s=2.5 for optimal fidelity/diversity balance.

4. Multi-Layer Garment (MLG) Dataset

MLG is a dataset specifically constructed for the ML-VTON task, supporting robust training and evaluation:

  • Composition: 3,538 samples, split into 2,783 training and 755 test quadruplets (gi,go,xp,xa)(g_i, g_o, x_p, x_a).
    • gig_i: inner garment product shot
    • gog_o: outer garment product shot
    • xpx_p: person image wearing both garments
    • xax_a: clothing-agnostic image (xpx_p with upper-body region masked using SCHP parsing)
  • Garment and pose diversity: Categories include T-shirts, vests, cardigans, jackets, dresses, and coats; poses cover standing, walking, and dynamic actions in varied environments.
  • Annotation schema: Product crops obtained or segmented via SAM; person parsing via SCHP provides upper-body mask MM; for each layer ii, pixelwise masks AiA_i, boundary bands BiB_i, and interiors CiC_i are defined for metric evaluation.

5. Evaluation Metric: Layered Appearance Coherence Difference (LACD)

Standard metrics such as FID, SSIM, and LPIPS provide limited sensitivity to inter-layer artifacts and occlusion errors. LACD is introduced to address these limitations by directly measuring per-layer and seam consistency:

  • Definitions:
    • For each garment layer ii, AiA_i is its pixel region, BiB_i is the seam band adjoining layer i+1i+1, and Ci=AiBiC_i = A_i\setminus B_i is the interior.
  • Layer discrepancy:

i=λ1jBixgt(i,j)xgen(i,j)2+kCixgt(i,k)xgen(i,k)2,\ell_i = \lambda_1 \sum_{j\in B_i} \left\| x_{gt}^{(i,j)} - x_{gen}^{(i,j)} \right\|_2 + \sum_{k\in C_i} \left\| x_{gt}^{(i,k)} - x_{gen}^{(i,k)} \right\|_2,

where λ1=3\lambda_1 = 3 upweights seam artifacts.

  • Overall metric:

LACD=1Ni=1Ni\mathrm{LACD} = \frac{1}{N} \sum_{i=1}^N \ell_i

  • Interpretation: The LACD emphasizes seam fidelity and inter-layer coherence, penalizing unrealistic edge transitions and “leakage” of occluded layers, making it highly discriminative for ML-VTON scenarios.

6. Experimental Results and Ablations

Extensive experiments on MLG benchmark the quantitative and qualitative advantages of GO-MLVTON:

Quantitative Comparison:

Method FID ↓ KID ↓ SSIM ↑ LPIPS ↓ LACD ↓
CAT-DM (CVPR’24) 32.36 5.97 0.845 0.138 0.719
MV-VTON (AAAI’25) 36.31 7.60 0.830 0.175 0.973
CATVTON (ICLR’25) 30.54 4.61 0.841 0.127 0.626
GO-MLVTON 22.82 0.35 0.858 0.108 0.623

GO-MLVTON outperforms all baselines, achieving significant improvements (e.g., a 7.7-point drop in FID, 4.26 lower KID versus the best baseline, and consistent gains in SSIM, LPIPS, and LACD).

GOL Module Ablation:

Ablation studies reveal that the GOL module alone (without LOCC\mathcal{L}_{\rm OCC}) may leave occluded inner-layer details visible. Full supervision with LOCC\mathcal{L}_{\rm OCC} is necessary for optimal suppression and minimal seam and LACD errors.

Classifier-Free Guidance (CFG) Hyperparameter:

The optimal CFG scale for the StableDiffusion UNet is s=2.5s=2.5, producing the best accuracy/diversity trade-off.

Qualitative Assessment:

Visual results confirm the elimination of inner garment bleed-through, preservation of garment boundary sharpness, and physically coherent layer interaction and deformation.

7. GO-MLVTON Pipeline Summary

A high-level pseudocode outline of the GO-MLVTON process is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
Input:
  x_p     ← person image wearing both garments
  g_i, g_o ← product-shot images of inner & outer garments
  M       ← upper-body mask from SCHP

1. Build agnostic input:
   x_a ← x_p ⊙ (1 - M)

2. VAE encode:
   z_a ← ε(x_a)
   z_i ← ε(g_i)
   z_o ← ε(g_o)

3. Garment Occlusion Learning (GOL):
   f_i ← E_i(g_i)
   f_o ← E_o(g_o)
   A   ← sigmoid( Linear( U( [f_o‖f_i] ) ) )
   z_iv ← A ⊙ z_i

4. Prepare diffusion input:
   z_in ← [ z_a ‖ z_o ‖ z_iv ]
   m_in ← [ M ‖ 0 ‖ 0 ]
   ε_in ← sample Gaussian noise

5. StableDiffusion-based Garment Morphing & Fitting (GMF):
   for t = T…1 do
     predict   ε_θ ← UNet( z^t , t | z_in, m_in )
     compute    L_GMF += || ε_actual - ε_θ ||^2
     z^{t-1}   ← denoise_step(z^t, ε_θ)
   end for

6. Decode output:
   x_g ← D( z^0 )

7. Total loss (train):
   L_total = L_GMF + 0.1 × L_OCC

GO-MLVTON thus establishes an effective paradigm for multi-layer virtual try-on, innovating across occlusion modeling, diffusion-based compositing, and dedicated evaluation for fashion-oriented generative vision tasks (Yu et al., 20 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GO-MLVTON.