Mean Opinion Score (MOS)

Updated 12 February 2026

Mean Opinion Score (MOS) is a scalar metric that averages human perceptual ratings to evaluate the quality of speech, audio, and image signals.
The methodology computes the arithmetic mean of discrete ratings with confidence intervals, addressing challenges like quantization and annotator bias.
MOS is widely applied in benchmarking TTS, generative systems, and even explainable AI, providing actionable insights for quality assessment.

The Mean Opinion Score (MOS) is a foundational metric in evaluation science, providing a scalar summary of human perceptual judgments over a range of signals, most prominently in speech and audio quality assessment but also with growing relevance in domains such as image quality and explainable AI. Originally introduced in telecommunication standards, MOS has become the “gold standard” for benchmarking generative, enhancement, and transmission systems due to its ability to capture human-centric quality attributes that are elusive to objective algorithms.

1. Formal Definition, Computation, and Statistical Properties

MOS is computed as the arithmetic mean of K discrete ratings assigned by a group of human judges to a given stimulus, typically on an ordinal scale (e.g., 1 = “bad” to 5 = “excellent”). For a stimulus $u$ and observed scores $\{s_{u,1},…,s_{u,K}\}$ ,

$\mathrm{MOS}(u) = \frac{1}{K}\sum_{k=1}^{K} s_{u,k}$

At the system level, MOS scores across all generated outputs are further averaged. In practice, MOS experiments employ Absolute Category Rating (ACR, ITU-T P.800), and results are often reported with confidence intervals based on t-statistics. For $N$ ratings:

$\mathrm{CI} = \bar{X}\;\pm\;t_{\alpha/2,\,N-1}\;\frac{S}{\sqrt{N}}$

where $S$ is the sample standard deviation of the ratings.

In image and XAI quality, the protocol is unchanged: a set of annotators rates each stimulus, and the MOS is the arithmetic mean over all observers after filtering outliers (Yu et al., 2024).

2. Limitations, Biases, and Score Interpretation

MOS, despite its ubiquity, suffers critical limitations relating to data quantization, annotator bias, and context dependency:

Quantization Effects: MOS relies on discretized labels, implicitly assuming annotators’ internal judgments are perfectly captured by rounding to the nearest integer. This ignores the true continuous nature of perceptual quality (Kondo et al., 23 Jun 2025).
Range-Equalizing Bias: Listeners stretch their use of the scale to fill the available range. When evaluated in narrow quality contexts (e.g., only high-quality systems), even minor degradations are rated "bad," possibly resulting in MOS shifts of up to one point for the worst system in the narrowest subset (Cooper et al., 2023). This bias causes MOS to be fundamentally relative: the absolute placement of systems on the 1–5 scale can change with the composition of the test set.
Annotator Bias and Inconsistency: Different raters use the scale differently, and each utterance/system is rated by a small random subset. The average can be skewed by individual tendencies, making direct comparison across utterances problematic (Leng et al., 2021, Huang et al., 2021).
Information Collapse: By reducing K ratings to a mean, traditional workflows discard K – 1 data points per utterance, lose access to the variance or higher moment statistics, and can miss valuable distributional information (Leng et al., 2021, Tseng et al., 2022).
Interpretation in Univariate Setting: A low MOS can be due to many different underlying artifacts (e.g., distortion, noise, prosodic errors) but offers no diagnostic decomposition (Cumlin et al., 5 Jun 2025).

Strategies such as fitting a quantized latent distribution to model the unobservable continuous quality and using its mode (rather than the raw mean) as the estimator address destructive effects of quantization and reduce estimation bias (Kondo et al., 23 Jun 2025).

3. MOS Collection Protocols, Datasets, and Best Practices

MOS experiments should maximize both statistical reliability and annotation quality through carefully structured test designs and rigorous screening. Key elements include:

Stimulus Design: Balanced test sets spanning domains, lengths, and attributes (e.g., 20–40 utterances/system) are standard (Maniati et al., 2022).
Annotator Recruitment: Stringent qualification filters (native fluency, approval rate) and mandatory use of calibrated equipment (headphones) are implemented in large-scale datasets such as SOMOS (Maniati et al., 2022).
Quality Control: Embedded ground-truth and validation stimuli, rejection of HITs with uniform answers, or "unnatural" labeling of trusted ground-truth clips are critical for reliable aggregation (Maniati et al., 2022).
Aggregation and Reliability: Bootstrap-based split-half reliability measures, expert validation, calculation of per-system and per-utterance correlations enable robust assessment of inter-rater agreement (Maniati et al., 2022).
Dataset Design: Use of a fixed vocoder across systems in TTS evaluation isolates acoustic-model effects and stabilizes ground-truth comparison (Maniati et al., 2022).

These curated datasets underpin algorithmic advances in MOS prediction and support community benchmarking.

4. Automatic MOS Prediction: Model Architectures and Trends

With the cost and time of subjective MOS collection, machine learning models for automatic MOS regression have become prominent. Architectures have evolved rapidly:

Classic Approaches: Early works (e.g., MOSNet) regressed the MOS directly from CNN or CNN-LSTM stacks over spectral features, ignoring per-rater variability.
Exploiting Full Opinion Distributions: MBNet (Leng et al., 2021) and LDNet (Huang et al., 2021) explicitly model judge-specific ratings by introducing mean/bias subnets or listener-ID embeddings. This dual modeling leverages the full $K$ -rating label vector, which improves both data efficiency and generalization by learning annotator-specific tendencies.
Distributional and Permissive Approaches: DDOS (Tseng et al., 2022) augments wav2vec architectures with distribution heads, learning to predict the full histogram of possible ratings. Some models regress both MOS and per-rater scores, or jointly learn mean and variance, thus capturing uncertainty and label spread.
Advanced Pooling and Temporal Modeling: DRASP (Yang et al., 29 Aug 2025) combines global and segmental attentive pooling, and state-space sequence models such as MambaRate (Kakoulidis et al., 16 Jul 2025) employ self-supervised front ends and RBF encoded regression for cross-domain and cross-sampling-rate robustness.
Compact, Streaming Models: SALF-MOS (Agrawal et al., 2 Jun 2025) achieves full-scale competitive MOS prediction with sub-2k parameter models through aggressive downsampling and multi-scale latent feature fusion.
Pairwise and Rank-Oriented Methods: MOSPC (Wang et al., 2023) reframes MOS regression as a pairwise ranking problem, exploiting the practical importance of correct relative ordering rather than precise MOS regression. Techniques such as C-Mixup inject artificial examples for better out-of-distribution generalization.
Multitask Learning and Rater Bias Correction: Recent advances co-train MOS with orthogonal perceptual or acoustic properties (e.g., clarity, reverberation time) and leverage multi-dataset/self-supervised semi-supervised approaches for robust modeling of missing labels or rater-specific bias (Akrami et al., 2022).

Evaluation consistently relies on MSE, Pearson correlation (LCC), Spearman’s rank (SRCC), and Kendall’s tau. System-level metrics are generally more stable and reliable than utterance-level, but utterance-level performance is critical for model selection and rapid system iteration (Maniati et al., 2022, Agrawal et al., 2 Jun 2025).

5. Alternatives and Extensions to Univariate MOS

A single scalar MOS cannot disambiguate underlying factors that contribute to “poor” overall scores. Multivariate frameworks address this:

Joint Quality Modeling: Probabilistic models predicting a vector of subjective scores (e.g., MOS, noisiness, coloration, discontinuity, loudness) via multivariate Gaussian posteriors with full covariance matrices enable quantification of both per-aspect uncertainty and inter-aspect dependencies (Cumlin et al., 5 Jun 2025).
Relative Quality and Single Opinion Calibration: For problems where only single-opinion scores are available, methods such as Perceptual Constancy Constrained Calibration (PC3) combine learnable relative MOS-difference regressors and optimization with constraints to recover consistent MOS estimates across pairings (Wang et al., 2024).
Cross-Domain and OOD Applications: Recent models demonstrate robust generalization across languages, speakers, codecs, and novel domains using few-shot or C-Mixup-based adaptation (Wang et al., 2023, Kakoulidis et al., 16 Jul 2025).

In explainable AI, MOS is adopted as a metric for user-centric evaluation of explanation quality, and its (moderate) correlation with automatic metrics such as IAUC and DAUC is quantified, highlighting the complementarity and limitations of computational vs. subjective evaluation (Yu et al., 2024).

6. Mitigating Biases and Improving Validity

Rigorous experimental design is necessary to minimize the context dependency and range-equalizing bias of MOS:

Anchors and Z-Scoring: Including poor and excellent anchors in the test set helps calibrate listener use of the scale, while post-hoc normalization (e.g., z-scoring MOS within each test context) reduces cross-context MOS drift (Cooper et al., 2023).
Modeling Rater Bias: Both statistical and machine learning frameworks can incorporate bias correction. Leave-one-out MOS, subtractive bias or affine calibration per rater, and explicit bias subnetworks improve label fidelity and model accuracy (Leng et al., 2021, Akrami et al., 2022).
Distribution-Fitting and Mode Estimation: Fitting latent normal distributions to the observed rating histogram and using their mode as the regression target better captures the underlying perceptual quality by removing range and quantization bias compared to the simple average (Kondo et al., 23 Jun 2025).

7. Applications, Impact, and Future Directions

Advances in MOS methodology have profound impact across domains:

Speech and Audio Synthesis: MOS remains the reference benchmark for evaluating and comparing TTS, voice conversion, coding, and enhancement systems. Recent MOS predictors replace or accelerate listening tests, enable fast iteration, and support deployment at scale (Yang et al., 29 Aug 2025, Leng et al., 2021).
Fake Audio Detection: MOS prediction contributes directly to downstream tasks such as fake audio detection, both for training sample selection and for MOS-gated model fusion, improving detection efficacy in the presence of synthetic waveform manipulations (Zhou et al., 2024).
Image and Explanation Assessment: The migration of MOS methodology into image quality and explainability research underscores its versatility and ability to bridge user-centered and algorithmic evaluation (Yu et al., 2024).
Robust Evaluation and Benchmarking: The development of large, open MOS datasets (e.g., SOMOS (Maniati et al., 2022)) catalyzes progress by providing standardized, highly-controlled benchmarks spanning systems, speakers, and domains.

Continued research is focused on: increasing reliability and objectivity in MOS collection; modeling richer perceptual aspects and their joint distributions; automating bias correction; implementing scalable, low-resource deployable predictors; and further validating MOS as a cross-domain user-centered metric (Cumlin et al., 5 Jun 2025, Yang et al., 29 Aug 2025, Agrawal et al., 2 Jun 2025).