Mean Opinion Scores (MOSs)
- Mean Opinion Scores (MOSs) are scalar metrics computed as the average of human ratings on a fixed scale to assess perceived quality in multimedia applications.
- They are essential in benchmarking image synthesis and virtual try-on systems, often comparing results from GANs and diffusion models using standardized human evaluations.
- Robust MOS protocols employ diverse rater pools, randomization, and statistical analysis to ensure reliable insights into visual realism and system performance.
Mean Opinion Scores (MOSs) are scalar summary measures derived from subjective human evaluations, widely used to assess the perceived quality of generated content in multimedia applications, especially in image- and video-based virtual try-on systems. MOSs serve as key ground-truth references for benchmarking computational image synthesis methods, including those employing GANs and diffusion models, by condensing large volumes of human ratings into a concise quantitative metric.
1. Definition and Mathematical Basis
A Mean Opinion Score is classically defined as the arithmetic mean of a set of human subject ratings over a fixed scale (typically 1–5 or 1–10, where higher scores denote higher perceived quality or realism):
Here, each is a subjective judgment, for example on a Likert scale or as a forced-choice preference mapped to numerical values. Statistical analysis assumes all are sampled IID from the underlying subjective distribution per sample.
2. MOS in Virtual Try-On: Roles and Methodological Protocols
In image-based virtual try-on (VTON), MOSs are deployed as primary indicators of system-level perceptual quality—encompassing naturalness, faithfulness, garment-body alignment, and plausibility under varied poses. State-of-the-art works in vision-based try-on, such as CatV2TON (Chong et al., 20 Jan 2025), report MOSs to compare against both previous GAN-based and diffusion-based pipelines. Protocols routinely involve:
- Selection of generated try-on images (and, optionally, ground truth or baseline outputs).
- Recruitment of human raters—either trained experts or crowdsourced workers (e.g., via Amazon Mechanical Turk).
- Standardized presentation instructions: "Please rate the photo-realism and garment realism of this person’s outfit."
- Use of a closed set of ratings (e.g., 1–5, integers).
- Replicates and randomization to minimize presentation and order bias.
For robust estimation, contemporary studies prefer >30 raters per sample and randomized presentation. Outlier rejection or z-score normalization is sometimes used to control for rater calibration drift.
3. Statistical Properties and Interpretation
MOSs exhibit statistical properties governed by the sampling scheme:
- The sample MOS is an unbiased estimate of the population mean subjective rating for the stimulus.
- Confidence intervals can be derived via the central limit theorem as
where is the sample variance and is the quantile from the normal distribution.
A key consideration is inter-rater variability. While the mean gives a global quality metric, best practice also reports standard deviation or median to reflect subjective dispersion, especially in datasets showing multimodality in perception (e.g., due to artifacts or diverse cultural standards).
4. MOS versus Computational Perceptual Metrics in Research Practice
Although image-based try-on systems are evaluated by paired and unpaired computational metrics (e.g., SSIM, FID, LPIPS), MOSs remain the definitive external perceptual reference (Chong et al., 20 Jan 2025, Xie et al., 2023, Zhang et al., 19 Nov 2025). For example, CatV2TON (Chong et al., 20 Jan 2025) is benchmarked using SSIM, LPIPS, FID, KID as well as human opinion scores in comprehensive experimental validation. GP-VTON specifically incorporates Human Evaluation (HE, an MOS measure), with human raters indicating perceptual realism percentages, showing high MOS favorability over baselines (e.g., GP-VTON achieves 50.9% MOS-based HE on VITON-HD).
Notably, the literature establishes that high MOS alignment with low FID/LPIPS and high SSIM is not guaranteed. MOS is sensitive to subtle aspects (garment-body misalignment, semantic inconsistencies, photo-compositing artifacts, user preference for specific textures) that may escape reference-free or reference-based computational metrics (Chong et al., 20 Jan 2025, Xie et al., 2023, Song et al., 2023).
5. Advanced and Variant MOS Protocols
Several VTON studies introduce MOS variants to dissect specific qualities:
- Pairwise Preference MOS: Each participant rates, for a set of systems A/B, which output appears more realistic or faithful. The MOS is then computed as the percentage of wins per system, e.g., “GP-VTON (ours) achieves 50.9% MOS-based human preference.”
- Attribute-specific MOS: Raters score realism separately for face, cloth alignment, garment texture, or scenario-specific criteria (e.g., sleeve-hand alignment). PG-VTON and PL-VTON (Zhang et al., 18 Mar 2025, Han et al., 16 Mar 2025) report MOS for limb region realism.
- Task-specific MOS: In multi-task VTON (single-garment, model-to-model, multi-garment), UniFit (Zhang et al., 19 Nov 2025) computes MOS per task and averages across synthetic scenarios, enabling fine-grained performance assessment.
6. Best Practices and Limitations
Well-designed MOS acquisition requires:
- Large, demographically diverse rater pools.
- Anonymization and randomization of reference/baseline identities.
- Clear guidance and quality control (e.g., attention checks or consistency questions).
Limitations are acknowledged:
- Training and rater fatigue can bias MOS downward in long surveys.
- MOS does not necessarily reflect target population preferences if raters differ systematically from end users.
- Subtle differences in instruction framing can yield significant MOS variation across studies.
A plausible implication is that the reported MOSs in the VTON literature provide a robust and reproducible, though not entirely absolute, measure of end-user experience. While MOSs remain the gold-standard perceptual metric, they should be interpreted alongside auxiliary computational metrics for comprehensive system validation.
7. MOS Reporting in the State-of-the-Art VTON Literature
The following table summarizes the reporting of MOSs and human evaluation measures in representative VTON works:
| Framework | MOS/HE Protocol | Key Quantitative Human Results |
|---|---|---|
| CatV2TON (Chong et al., 20 Jan 2025) | MOS (photorealism, garment alignment) | Best SSIM, 2nd-best FID/KID, high subjective realism |
| GP-VTON (Xie et al., 2023) | Human Evaluation (HE, % MOS-like) | 50.9% on VITON-HD, superior to all baselines |
| UniFit (Zhang et al., 19 Nov 2025) | MOS per complex scenario, multi-garment | Consistently leading or matching SOTA in human rating |
| C-VTON (Fele et al., 2022) | MTurk pairwise preference (MOS) | 52–76% preference over baselines |
| PL-VTON [(Han et al., 16 Mar 2025)/(Zhang et al., 18 Mar 2025)] | User preference/MOS for sleeve/limb handling | 62–83% in A/B studies, >75% realism for limbs |
| PG-VTON (Fang et al., 2023) | User study (perceptual fit/skin realism) | Outperforms all baselines in high-difficulty scenarios |
In sum, Mean Opinion Scores provide a statistically grounded, human-centered, and broadly adopted standard for benchmarking perceptual quality in VTON research. Their rigorous deployment is critical for objectively comparing increasingly sophisticated image- and video-based virtual try-on models.