GO-MLVTON: Diffusion-Based Multi-Layer VTON
- GO-MLVTON is a multi-layer virtual try-on approach that explicitly learns pixel-level occlusion relationships between inner and outer garments.
- It integrates a Garment Occlusion Learning module with a StableDiffusion-based morphing module to achieve realistic deformation and artifact-free layering.
- Evaluated on the MLG dataset using the LACD metric, GO-MLVTON outperforms previous methods in FID, SSIM, and overall visual coherence.
GO-MLVTON (Garment Occlusion-Aware Multi-Layer Virtual Try-On with Diffusion Models) is a methodology for multi-layer garment virtual try-on (ML-VTON) that jointly addresses the challenges of garment occlusion modeling, spatial deformation, and layer-to-layer visual coherence. By explicitly learning pixel-level occlusion relationships between inner and outer garments, GO-MLVTON produces artifact-free, realistic multi-layer try-on results, addressing limitations in prior single-layer or multi-garment VTON approaches. The system integrates a dedicated Garment Occlusion Learning module, a StableDiffusion-based morphing and fitting module, and is benchmarked on the newly introduced MLG dataset with a layer- and edge-sensitive evaluation metric, LACD (Yu et al., 20 Jan 2026).
1. Multi-Layer Virtual Try-On and the Challenge of Occlusion
Image-based virtual try-on (VTON) research has traditionally focused on either single-garment try-on (SG-VTON), synthesizing one target garment onto a person, or multi-garment (MG-VTON), compositing non-overlapping garments without enforcing physical overlap or occlusion. Real-world usage, however, typically involves dressing multiple, sometimes overlapping, layers (e.g., T-shirt under a jacket), where physical realism demands:
- Accurate spatial occlusion—outer garments must mask overlapping inner garment regions.
- Realistic deformation—proper draping and fitting of both inner and outer layers to the target body and pose.
- Artifact-free compositing—elimination of “bleed-through” or ghosting from occluded inner garment pixels.
Previous methods lack explicit pixel-level occlusion reasoning or the ability to deform and layer garments in a physically plausible manner. GO-MLVTON directly addresses these by introducing two interlinked components: the Garment Occlusion Learning (GOL) module and the Garment Morphing & Fitting (GMF) module built atop a diffusion model.
2. Garment Occlusion Learning (GOL) Module
The GOL module is designed to infer a spatially-varying attention mask that determines which pixels of an inner garment remain visible after layering with an outer garment. The process involves:
- Dual-encoder feature extraction: Separate encoders and , each with five layers (composed of downsample convolutions followed by residual blocks), process the inner () and outer () garment images to yield feature maps and .
- Mapping to occlusion attention: Channel-concatenated features are passed through a mapping network and a final convolution, followed by a sigmoid nonlinearity, producing the occlusion mask .
- Occlusion-weighted inner garment latent: This attention mask is applied element-wise to the latent representation of the inner garment ( from the shared VAE encoder), yielding .
Loss function: A reconstruction loss enforces that matches the visible inner garment as cropped from the real person image via
with in the overall objective. This direct supervision ensures the GOL module learns to suppress occluded regions effectively. Backpropagating leads to optimal occlusion-aware masking, especially at garment boundaries.
3. StableDiffusion-Based Garment Morphing & Fitting (GMF) Module
Following occlusion reasoning, the GMF module synthesizes the final try-on output, adapting diffusion modeling to multi-layer conditioning:
- Latent conditioning: The core conditional input is the concatenation where is the clothing-agnostic person latent (body pose/shape), is the outer garment latent, and is the occlusion-weighted inner garment latent.
- Diffusion UNet: A StableDiffusion v1.5 UNet, with cross-attention layers removed for architectural alignment with ML-VTON tasks, jointly models the diffusion denoising process conditioned on these latents as well as a binary inpainting mask for masked person areas.
- Training objective: The denoising loss per diffusion step is
and the combined objective is
- Optimization and inference: Model initialization starts from InstructPix2Pix-pretrained StableDiffusion weights; most weights remain frozen except for GOL and self-attention layers. AdamW is used with learning rate , batch size $8$, and conditional dropout. Inference incorporates classifier-free guidance (CFG) with scale for optimal fidelity/diversity balance.
4. Multi-Layer Garment (MLG) Dataset
MLG is a dataset specifically constructed for the ML-VTON task, supporting robust training and evaluation:
- Composition: 3,538 samples, split into 2,783 training and 755 test quadruplets .
- : inner garment product shot
- : outer garment product shot
- : person image wearing both garments
- : clothing-agnostic image ( with upper-body region masked using SCHP parsing)
- Garment and pose diversity: Categories include T-shirts, vests, cardigans, jackets, dresses, and coats; poses cover standing, walking, and dynamic actions in varied environments.
- Annotation schema: Product crops obtained or segmented via SAM; person parsing via SCHP provides upper-body mask ; for each layer , pixelwise masks , boundary bands , and interiors are defined for metric evaluation.
5. Evaluation Metric: Layered Appearance Coherence Difference (LACD)
Standard metrics such as FID, SSIM, and LPIPS provide limited sensitivity to inter-layer artifacts and occlusion errors. LACD is introduced to address these limitations by directly measuring per-layer and seam consistency:
- Definitions:
- For each garment layer , is its pixel region, is the seam band adjoining layer , and is the interior.
- Layer discrepancy:
where upweights seam artifacts.
- Overall metric:
- Interpretation: The LACD emphasizes seam fidelity and inter-layer coherence, penalizing unrealistic edge transitions and “leakage” of occluded layers, making it highly discriminative for ML-VTON scenarios.
6. Experimental Results and Ablations
Extensive experiments on MLG benchmark the quantitative and qualitative advantages of GO-MLVTON:
Quantitative Comparison:
| Method | FID ↓ | KID ↓ | SSIM ↑ | LPIPS ↓ | LACD ↓ |
|---|---|---|---|---|---|
| CAT-DM (CVPR’24) | 32.36 | 5.97 | 0.845 | 0.138 | 0.719 |
| MV-VTON (AAAI’25) | 36.31 | 7.60 | 0.830 | 0.175 | 0.973 |
| CATVTON (ICLR’25) | 30.54 | 4.61 | 0.841 | 0.127 | 0.626 |
| GO-MLVTON | 22.82 | 0.35 | 0.858 | 0.108 | 0.623 |
GO-MLVTON outperforms all baselines, achieving significant improvements (e.g., a 7.7-point drop in FID, 4.26 lower KID versus the best baseline, and consistent gains in SSIM, LPIPS, and LACD).
GOL Module Ablation:
Ablation studies reveal that the GOL module alone (without ) may leave occluded inner-layer details visible. Full supervision with is necessary for optimal suppression and minimal seam and LACD errors.
Classifier-Free Guidance (CFG) Hyperparameter:
The optimal CFG scale for the StableDiffusion UNet is , producing the best accuracy/diversity trade-off.
Qualitative Assessment:
Visual results confirm the elimination of inner garment bleed-through, preservation of garment boundary sharpness, and physically coherent layer interaction and deformation.
7. GO-MLVTON Pipeline Summary
A high-level pseudocode outline of the GO-MLVTON process is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
Input:
x_p ← person image wearing both garments
g_i, g_o ← product-shot images of inner & outer garments
M ← upper-body mask from SCHP
1. Build agnostic input:
x_a ← x_p ⊙ (1 - M)
2. VAE encode:
z_a ← ε(x_a)
z_i ← ε(g_i)
z_o ← ε(g_o)
3. Garment Occlusion Learning (GOL):
f_i ← E_i(g_i)
f_o ← E_o(g_o)
A ← sigmoid( Linear( U( [f_o‖f_i] ) ) )
z_iv ← A ⊙ z_i
4. Prepare diffusion input:
z_in ← [ z_a ‖ z_o ‖ z_iv ]
m_in ← [ M ‖ 0 ‖ 0 ]
ε_in ← sample Gaussian noise
5. StableDiffusion-based Garment Morphing & Fitting (GMF):
for t = T…1 do
predict ε_θ ← UNet( z^t , t | z_in, m_in )
compute L_GMF += || ε_actual - ε_θ ||^2
z^{t-1} ← denoise_step(z^t, ε_θ)
end for
6. Decode output:
x_g ← D( z^0 )
7. Total loss (train):
L_total = L_GMF + 0.1 × L_OCC |
GO-MLVTON thus establishes an effective paradigm for multi-layer virtual try-on, innovating across occlusion modeling, diffusion-based compositing, and dedicated evaluation for fashion-oriented generative vision tasks (Yu et al., 20 Jan 2026).