GO-MLVTON: Diffusion-Based Multi-Layer VTON

Updated 27 January 2026

GO-MLVTON is a multi-layer virtual try-on approach that explicitly learns pixel-level occlusion relationships between inner and outer garments.
It integrates a Garment Occlusion Learning module with a StableDiffusion-based morphing module to achieve realistic deformation and artifact-free layering.
Evaluated on the MLG dataset using the LACD metric, GO-MLVTON outperforms previous methods in FID, SSIM, and overall visual coherence.

GO-MLVTON (Garment Occlusion-Aware Multi-Layer Virtual Try-On with Diffusion Models) is a methodology for multi-layer garment virtual try-on (ML-VTON) that jointly addresses the challenges of garment occlusion modeling, spatial deformation, and layer-to-layer visual coherence. By explicitly learning pixel-level occlusion relationships between inner and outer garments, GO-MLVTON produces artifact-free, realistic multi-layer try-on results, addressing limitations in prior single-layer or multi-garment VTON approaches. The system integrates a dedicated Garment Occlusion Learning module, a StableDiffusion-based morphing and fitting module, and is benchmarked on the newly introduced MLG dataset with a layer- and edge-sensitive evaluation metric, LACD (Yu et al., 20 Jan 2026).

1. Multi-Layer Virtual Try-On and the Challenge of Occlusion

Image-based virtual try-on (VTON) research has traditionally focused on either single-garment try-on (SG-VTON), synthesizing one target garment onto a person, or multi-garment (MG-VTON), compositing non-overlapping garments without enforcing physical overlap or occlusion. Real-world usage, however, typically involves dressing multiple, sometimes overlapping, layers (e.g., T-shirt under a jacket), where physical realism demands:

Accurate spatial occlusion—outer garments must mask overlapping inner garment regions.
Realistic deformation—proper draping and fitting of both inner and outer layers to the target body and pose.
Artifact-free compositing—elimination of “bleed-through” or ghosting from occluded inner garment pixels.

Previous methods lack explicit pixel-level occlusion reasoning or the ability to deform and layer garments in a physically plausible manner. GO-MLVTON directly addresses these by introducing two interlinked components: the Garment Occlusion Learning (GOL) module and the Garment Morphing & Fitting (GMF) module built atop a diffusion model.

2. Garment Occlusion Learning (GOL) Module

The GOL module is designed to infer a spatially-varying attention mask that determines which pixels of an inner garment remain visible after layering with an outer garment. The process involves:

Dual-encoder feature extraction: Separate encoders $E_i$ and $E_o$ , each with five layers (composed of downsample convolutions followed by residual blocks), process the inner ( $g_i$ ) and outer ( $g_o$ ) garment images to yield feature maps $f_i$ and $f_o$ .
Mapping to occlusion attention: Channel-concatenated features $[f_o\Vert f_i]$ are passed through a mapping network $U$ and a final $1\times 1$ convolution, followed by a sigmoid nonlinearity, producing the occlusion mask $A \in [0,1]^{1 \times H' \times W'}$ .
Occlusion-weighted inner garment latent: This attention mask is applied element-wise to the latent representation of the inner garment ( $z_i$ from the shared VAE encoder), yielding $z_{iv} = A \odot z_i$ .

Loss function: A reconstruction loss enforces that $z_{iv}$ matches the visible inner garment as cropped from the real person image $x_{pi}$ via

$\mathcal{L}_{\rm OCC} = \left\| \varepsilon(x_{pi}) - z_{iv} \right\|_2^2,$

with $\lambda_2 = 0.1$ in the overall objective. This direct supervision ensures the GOL module learns to suppress occluded regions effectively. Backpropagating $\mathcal{L}_{\rm OCC}$ leads to optimal occlusion-aware masking, especially at garment boundaries.

3. StableDiffusion-Based Garment Morphing & Fitting (GMF) Module

Following occlusion reasoning, the GMF module synthesizes the final try-on output, adapting diffusion modeling to multi-layer conditioning:

Latent conditioning: The core conditional input is the concatenation $[z_a \Vert z_o \Vert z_{iv}]$ where $z_a$ is the clothing-agnostic person latent (body pose/shape), $z_o$ is the outer garment latent, and $z_{iv}$ is the occlusion-weighted inner garment latent.
Diffusion UNet: A StableDiffusion v1.5 UNet, with cross-attention layers removed for architectural alignment with ML-VTON tasks, jointly models the diffusion denoising process conditioned on these latents as well as a binary inpainting mask for masked person areas.
Training objective: The denoising loss per diffusion step $t$ is

$\mathcal{L}_{\rm GMF} = \mathbb{E}_{t, \epsilon} \left\| \epsilon - \epsilon_\theta(z^t, t, z^t_{\rm in}, m_{\rm in}) \right\|_2^2,$

and the combined objective is

$\mathcal{L} = \mathcal{L}_{\rm GMF} + \lambda_2\mathcal{L}_{\rm OCC}.$

Optimization and inference: Model initialization starts from InstructPix2Pix-pretrained StableDiffusion weights; most weights remain frozen except for GOL and self-attention layers. AdamW is used with learning rate $1\times10^{-5}$ , batch size $8$, and $10\%$ conditional dropout. Inference incorporates classifier-free guidance (CFG) with scale $s=2.5$ for optimal fidelity/diversity balance.

4. Multi-Layer Garment (MLG) Dataset

MLG is a dataset specifically constructed for the ML-VTON task, supporting robust training and evaluation:

Composition: 3,538 samples, split into 2,783 training and 755 test quadruplets $(g_i, g_o, x_p, x_a)$ $(g_{i}, g_{o}, x_{p}, x_{a})$ .
- $g_i$ : inner garment product shot
- $g_o$ : outer garment product shot
- $x_p$ : person image wearing both garments
- $x_a$ : clothing-agnostic image ( $x_p$ with upper-body region masked using SCHP parsing)
Garment and pose diversity: Categories include T-shirts, vests, cardigans, jackets, dresses, and coats; poses cover standing, walking, and dynamic actions in varied environments.
Annotation schema: Product crops obtained or segmented via SAM; person parsing via SCHP provides upper-body mask $M$ ; for each layer $i$ , pixelwise masks $A_i$ , boundary bands $B_i$ , and interiors $C_i$ are defined for metric evaluation.

5. Evaluation Metric: Layered Appearance Coherence Difference (LACD)

Standard metrics such as FID, SSIM, and LPIPS provide limited sensitivity to inter-layer artifacts and occlusion errors. LACD is introduced to address these limitations by directly measuring per-layer and seam consistency:

Definitions:
- For each garment layer $i$ , $A_i$ is its pixel region, $B_i$ is the seam band adjoining layer $i+1$ , and $C_i = A_i\setminus B_i$ is the interior.
Layer discrepancy:

$\ell_i = \lambda_1 \sum_{j\in B_i} \left\| x_{gt}^{(i,j)} - x_{gen}^{(i,j)} \right\|_2 + \sum_{k\in C_i} \left\| x_{gt}^{(i,k)} - x_{gen}^{(i,k)} \right\|_2,$

where $\lambda_1 = 3$ upweights seam artifacts.

Overall metric:

$\mathrm{LACD} = \frac{1}{N} \sum_{i=1}^N \ell_i$

Interpretation: The LACD emphasizes seam fidelity and inter-layer coherence, penalizing unrealistic edge transitions and “leakage” of occluded layers, making it highly discriminative for ML-VTON scenarios.

6. Experimental Results and Ablations

Extensive experiments on MLG benchmark the quantitative and qualitative advantages of GO-MLVTON:

Quantitative Comparison:

Method	FID ↓	KID ↓	SSIM ↑	LPIPS ↓	LACD ↓
CAT-DM (CVPR’24)	32.36	5.97	0.845	0.138	0.719
MV-VTON (AAAI’25)	36.31	7.60	0.830	0.175	0.973
CATVTON (ICLR’25)	30.54	4.61	0.841	0.127	0.626
GO-MLVTON	22.82	0.35	0.858	0.108	0.623

GO-MLVTON outperforms all baselines, achieving significant improvements (e.g., a 7.7-point drop in FID, 4.26 lower KID versus the best baseline, and consistent gains in SSIM, LPIPS, and LACD).

GOL Module Ablation:

Ablation studies reveal that the GOL module alone (without $\mathcal{L}_{\rm OCC}$ ) may leave occluded inner-layer details visible. Full supervision with $\mathcal{L}_{\rm OCC}$ is necessary for optimal suppression and minimal seam and LACD errors.

Classifier-Free Guidance (CFG) Hyperparameter:

The optimal CFG scale for the StableDiffusion UNet is $s=2.5$ , producing the best accuracy/diversity trade-off.

Qualitative Assessment:

Visual results confirm the elimination of inner garment bleed-through, preservation of garment boundary sharpness, and physically coherent layer interaction and deformation.

7. GO-MLVTON Pipeline Summary

A high-level pseudocode outline of the GO-MLVTON process is as follows:

Input:
  x_p     ← person image wearing both garments
  g_i, g_o ← product-shot images of inner & outer garments
  M       ← upper-body mask from SCHP

1. Build agnostic input:
   x_a ← x_p ⊙ (1 - M)

2. VAE encode:
   z_a ← ε(x_a)
   z_i ← ε(g_i)
   z_o ← ε(g_o)

3. Garment Occlusion Learning (GOL):
   f_i ← E_i(g_i)
   f_o ← E_o(g_o)
   A   ← sigmoid( Linear( U( [f_o‖f_i] ) ) )
   z_iv ← A ⊙ z_i

4. Prepare diffusion input:
   z_in ← [ z_a ‖ z_o ‖ z_iv ]
   m_in ← [ M ‖ 0 ‖ 0 ]
   ε_in ← sample Gaussian noise

5. StableDiffusion-based Garment Morphing & Fitting (GMF):
   for t = T…1 do
     predict   ε_θ ← UNet( z^t , t | z_in, m_in )
     compute    L_GMF += || ε_actual - ε_θ ||^2
     z^{t-1}   ← denoise_step(z^t, ε_θ)
   end for

6. Decode output:
   x_g ← D( z^0 )

7. Total loss (train):
   L_total = L_GMF + 0.1 × L_OCC

GO-MLVTON thus establishes an effective paradigm for multi-layer virtual try-on, innovating across occlusion modeling, diffusion-based compositing, and dedicated evaluation for fashion-oriented generative vision tasks (Yu et al., 20 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

GO-MLVTON: Garment Occlusion-Aware Multi-Layer Virtual Try-On with Diffusion Models (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GO-MLVTON.

GO-MLVTON: Diffusion-Based Multi-Layer VTON

1. Multi-Layer Virtual Try-On and the Challenge of Occlusion

2. Garment Occlusion Learning (GOL) Module

3. StableDiffusion-Based Garment Morphing & Fitting (GMF) Module

4. Multi-Layer Garment (MLG) Dataset

5. Evaluation Metric: Layered Appearance Coherence Difference (LACD)

6. Experimental Results and Ablations

7. GO-MLVTON Pipeline Summary

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

GO-MLVTON: Diffusion-Based Multi-Layer VTON

1. Multi-Layer Virtual Try-On and the Challenge of Occlusion

2. Garment Occlusion Learning (GOL) Module

3. StableDiffusion-Based Garment Morphing & Fitting (GMF) Module

4. Multi-Layer Garment (MLG) Dataset

5. Evaluation Metric: Layered Appearance Coherence Difference (LACD)

6. Experimental Results and Ablations

7. GO-MLVTON Pipeline Summary

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research