UTMOS System for VoiceMOS Challenge

Updated 20 January 2026

UTMOS is a system that leverages advanced self-supervised learning and ensemble stacking to predict human mean opinion scores for synthetic speech.
It integrates listener, domain, and phoneme information with data augmentation to enhance performance across both in-domain and out-of-domain tracks.
The architecture combines strong frame-level SSL models with weak classical regressors in a three-stage stacking ensemble to improve accuracy and ranking metrics.

The UTMOS system, developed by the UTokyo-SaruLab, addresses the automated prediction of human mean opinion scores (MOS) for synthetic speech, as required for the VoiceMOS Challenge 2022. UTMOS integrates end-to-end fine-tuned self-supervised learning (SSL) models, advanced listener- and domain-aware modeling, phoneme encoding, data augmentation, and powerful ensemble stacking with weak classical regression models to achieve state-of-the-art performance for both in-domain and low-resource out-of-domain tracks (Saeki et al., 2022).

1. Task Definition and Dataset Structure

The VoiceMOS Challenge task is to estimate MOS, the scalar average of multiple human listener ratings of speech naturalness, for utterances generated by a range of text-to-speech (TTS) and voice conversion (VC) systems. The challenge includes two distinct tracks:

Main track ("in-domain"): English synthetic speech from 187 systems, compiled from the Blizzard and Voice Conversion Challenges (BVCC), plus ESPnet-TTS outputs.
Out-of-domain (OOD) track: Chinese synthetic speech, collected under a different listening-test protocol, with only 136 labeled training utterances and 540 unlabeled. Each utterance is rated on a 1–5 scale in both tracks.

The main track provides 4,974 train, 1,066 dev, and 1,066 test clips, each with 7–8 MOS ratings and a large set of systems and listeners. The OOD track comprises 136 train/dev/test labeled utterances, each with 13–14 ratings, and 540 unlabeled utterances (Saeki et al., 2022).

2. UTMOS System Architecture: Strong and Weak Learners

UTMOS employs a three-stage stacking ensemble comprising:

Strong learners: Fine-tuned SSL models with frame-level score prediction, extended by listener, domain, and phoneme information.
Weak learners: Classical regressors applied to mean-pooled SSL utterance embeddings.

Strong learners process each utterance as:

16 kHz waveform $\rightarrow$ wav2vec2.0-base SSL encoder (pretrained on Librispeech) generating frame-level embeddings $\mathbf{h}_t$ .
Sequence of embeddings passed through a BLSTM followed by a linear head to produce frame-level scores $\hat{s}_t$ ; utterance MOS prediction computed as $\hat{s} = \frac{1}{T}\sum_t \hat{s}_t$ .
Additional listener and domain ID embeddings ( $e_\text{listener}$ , $e_\text{domain}$ , 128-dimensional each) concatenated to frame features before the BLSTM.
Phoneme sequences obtained via zero-shot ASR, clustered to generate a reference sequence, and encoded by a 3-layer BLSTM (256 units/layer). These representations are concatenated to augment the SSL features.
Data augmentation is applied via speaking-rate and pitch perturbations using WavAugment.

Weak learners extract mean-pooled utterance embeddings from several SSL models (eight total, including HuBERT and WavLM), then apply six regression methods: ridge regression, linear and kernel SVR, Gaussian process regression, random forest, and LightGBM. For the main track: 48 weak learners (8 models $\times$ 6 regressors). For the OOD track: 144 (from domain-specific training) (Saeki et al., 2022).

Ensemble stacking is organized as follows:

Stage 0: Out-of-fold predictions from all base models using five-fold cross-validation.
Stage 1: Six meta-learners fuse Stage 0 predictions.
Stage 2: Final fusion/averaging of Stage 1 outputs.

3. Training Methodology and Optimization

Strong learner training processes:

Inputs down-sampled to 16 kHz, volume-normalized, and labels scaled to $[-1, 1]$ .
Optimization via Adam ( $\beta_1=0.9,\ \beta_2=0.99$ ) with linear warmup (4,000 steps) and linear decay over 15,000 steps. Batch size: 12. Early stopping monitors system-level SRCC.

Loss function is a weighted sum:

Clipped MSE regression: $L^{\text{reg}}(y, \hat{y}) = \mathbb{1}(|y-\hat{y}|>\tau)\cdot(y-\hat{y})^2$ , with $\tau=0.25$ .
Contrastive loss: $L^{\text{con}}_{ij} = \max(0, |d_{ij} - \hat{d}_{ij}| - \alpha)$ for all in-batch pairs $(i,j)$ , where $d_{ij}=s_i-s_j$ and $\hat{d}_{ij} = \hat{s}_i - \hat{s}_j$ , with $\alpha=0.5$ .
Total loss: $L= \beta L^{\text{reg}} + \gamma L^{\text{con}}$ , main track $\beta=1$ , $\gamma=0.5$ .

Weak learner/meta-learner training employs Optuna for system-level SRCC hyperparameter optimization and identical cross-validation splits. Final stacking averages results across five runs.

For the OOD track, 540 unlabeled Chinese utterances were externally labeled by 32 native listeners; these additional pseudo-labeled utterances are combined with the original labeled set to improve low-resource performance (Saeki et al., 2022).

4. Evaluation Protocols, Metrics, and Results

Metrics:

MSE: $(1/N)\sum_{i=1}^N (y_i-\hat{y}_i)^2$
SRCC: $1-6\sum d_i^2/(N(N^2-1))$ , $d_i$ being rank differences.
Kendall’s $\tau$ measures pair concordance.

Main track (Team T17 UTMOS):

Utterance-level: MSE=0.165 (rank 1), SRCC=0.897 (rank 1)
System-level: MSE=0.090 (rank 1), SRCC=0.936 (rank 3)

OOD track:

Utterance-level: MSE=0.162 (rank 1), SRCC=0.893 (rank 2)
System-level: MSE=0.030 (rank 1), SRCC=0.988 (rank 1)

Stacking effects:

Ensemble	Main: MSE	Main: SRCC	OOD: MSE	OOD: SRCC
Single strong	0.216	0.890	0.280	0.885
Strong×17	0.169	0.893	0.155	0.896
Weak only	0.186	0.885	0.176	0.882
Combined S+W (Main)	0.165	0.896	0.162	0.892

Monotonically improved accuracy is observed with increased ensemble size.

5. Ablation Studies and Contributing Factors

Ablation results delineate the contribution of each architectural component:

Listener ID provides the largest effect, particularly for OOD: removal causes MSE to rise from 0.378 to 0.636, and SRCC to drop from 0.871 to 0.825 (OOD).
Contrastive loss raises ranking metrics; removing it degrades SRCC.
Phoneme encoder confers measurable but smaller gains.
Data augmentation and external data are essential for low-data OOD generalization.
Even a single strong learner outperforms SSL-MOS baseline, but ensembling further lowers bias (MSE) without harming rank-based metrics (Saeki et al., 2022).

6. Key Insights, Limitations, and Future Directions

UTMOS achieves high MOS prediction accuracy by:

Combining frame-level regression/fine-tuning of large SSL models with listener/domain and phoneme-aware contextualization.
Augmenting data and leveraging external datasets for enhanced robustness, especially in OOD contexts.
Stacking strong and weak learners to synergistically reduce variance and bias.

Documented limitations include reliance on large pretrained models and significant training complexity. OOD generalization remains sensitive to the availability of some in-domain examples or external labeling.

Proposed directions include expanding speech and test diversity for universal MOS predictors, lighter SSL backbones for edge deployment, and improved phoneme/text alignment to more cleanly separate intelligibility and naturalness effects (Saeki et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022 (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to UTMOS System for VoiceMOS Challenge.