UTMOS System for VoiceMOS Challenge
- UTMOS is a system that leverages advanced self-supervised learning and ensemble stacking to predict human mean opinion scores for synthetic speech.
- It integrates listener, domain, and phoneme information with data augmentation to enhance performance across both in-domain and out-of-domain tracks.
- The architecture combines strong frame-level SSL models with weak classical regressors in a three-stage stacking ensemble to improve accuracy and ranking metrics.
The UTMOS system, developed by the UTokyo-SaruLab, addresses the automated prediction of human mean opinion scores (MOS) for synthetic speech, as required for the VoiceMOS Challenge 2022. UTMOS integrates end-to-end fine-tuned self-supervised learning (SSL) models, advanced listener- and domain-aware modeling, phoneme encoding, data augmentation, and powerful ensemble stacking with weak classical regression models to achieve state-of-the-art performance for both in-domain and low-resource out-of-domain tracks (Saeki et al., 2022).
1. Task Definition and Dataset Structure
The VoiceMOS Challenge task is to estimate MOS, the scalar average of multiple human listener ratings of speech naturalness, for utterances generated by a range of text-to-speech (TTS) and voice conversion (VC) systems. The challenge includes two distinct tracks:
- Main track ("in-domain"): English synthetic speech from 187 systems, compiled from the Blizzard and Voice Conversion Challenges (BVCC), plus ESPnet-TTS outputs.
- Out-of-domain (OOD) track: Chinese synthetic speech, collected under a different listening-test protocol, with only 136 labeled training utterances and 540 unlabeled. Each utterance is rated on a 1–5 scale in both tracks.
The main track provides 4,974 train, 1,066 dev, and 1,066 test clips, each with 7–8 MOS ratings and a large set of systems and listeners. The OOD track comprises 136 train/dev/test labeled utterances, each with 13–14 ratings, and 540 unlabeled utterances (Saeki et al., 2022).
2. UTMOS System Architecture: Strong and Weak Learners
UTMOS employs a three-stage stacking ensemble comprising:
- Strong learners: Fine-tuned SSL models with frame-level score prediction, extended by listener, domain, and phoneme information.
- Weak learners: Classical regressors applied to mean-pooled SSL utterance embeddings.
Strong learners process each utterance as:
- 16 kHz waveform wav2vec2.0-base SSL encoder (pretrained on Librispeech) generating frame-level embeddings .
- Sequence of embeddings passed through a BLSTM followed by a linear head to produce frame-level scores ; utterance MOS prediction computed as .
- Additional listener and domain ID embeddings (, , 128-dimensional each) concatenated to frame features before the BLSTM.
- Phoneme sequences obtained via zero-shot ASR, clustered to generate a reference sequence, and encoded by a 3-layer BLSTM (256 units/layer). These representations are concatenated to augment the SSL features.
- Data augmentation is applied via speaking-rate and pitch perturbations using WavAugment.
Weak learners extract mean-pooled utterance embeddings from several SSL models (eight total, including HuBERT and WavLM), then apply six regression methods: ridge regression, linear and kernel SVR, Gaussian process regression, random forest, and LightGBM. For the main track: 48 weak learners (8 models 6 regressors). For the OOD track: 144 (from domain-specific training) (Saeki et al., 2022).
Ensemble stacking is organized as follows:
- Stage 0: Out-of-fold predictions from all base models using five-fold cross-validation.
- Stage 1: Six meta-learners fuse Stage 0 predictions.
- Stage 2: Final fusion/averaging of Stage 1 outputs.
3. Training Methodology and Optimization
Strong learner training processes:
- Inputs down-sampled to 16 kHz, volume-normalized, and labels scaled to .
- Optimization via Adam () with linear warmup (4,000 steps) and linear decay over 15,000 steps. Batch size: 12. Early stopping monitors system-level SRCC.
Loss function is a weighted sum:
- Clipped MSE regression: , with .
- Contrastive loss: for all in-batch pairs , where and , with .
- Total loss: , main track , .
Weak learner/meta-learner training employs Optuna for system-level SRCC hyperparameter optimization and identical cross-validation splits. Final stacking averages results across five runs.
For the OOD track, 540 unlabeled Chinese utterances were externally labeled by 32 native listeners; these additional pseudo-labeled utterances are combined with the original labeled set to improve low-resource performance (Saeki et al., 2022).
4. Evaluation Protocols, Metrics, and Results
Metrics:
- MSE:
- SRCC: , being rank differences.
- Kendall’s measures pair concordance.
Main track (Team T17 UTMOS):
- Utterance-level: MSE=0.165 (rank 1), SRCC=0.897 (rank 1)
- System-level: MSE=0.090 (rank 1), SRCC=0.936 (rank 3)
OOD track:
- Utterance-level: MSE=0.162 (rank 1), SRCC=0.893 (rank 2)
- System-level: MSE=0.030 (rank 1), SRCC=0.988 (rank 1)
Stacking effects:
| Ensemble | Main: MSE | Main: SRCC | OOD: MSE | OOD: SRCC |
|---|---|---|---|---|
| Single strong | 0.216 | 0.890 | 0.280 | 0.885 |
| Strong×17 | 0.169 | 0.893 | 0.155 | 0.896 |
| Weak only | 0.186 | 0.885 | 0.176 | 0.882 |
| Combined S+W (Main) | 0.165 | 0.896 | 0.162 | 0.892 |
Monotonically improved accuracy is observed with increased ensemble size.
5. Ablation Studies and Contributing Factors
Ablation results delineate the contribution of each architectural component:
- Listener ID provides the largest effect, particularly for OOD: removal causes MSE to rise from 0.378 to 0.636, and SRCC to drop from 0.871 to 0.825 (OOD).
- Contrastive loss raises ranking metrics; removing it degrades SRCC.
- Phoneme encoder confers measurable but smaller gains.
- Data augmentation and external data are essential for low-data OOD generalization.
- Even a single strong learner outperforms SSL-MOS baseline, but ensembling further lowers bias (MSE) without harming rank-based metrics (Saeki et al., 2022).
6. Key Insights, Limitations, and Future Directions
UTMOS achieves high MOS prediction accuracy by:
- Combining frame-level regression/fine-tuning of large SSL models with listener/domain and phoneme-aware contextualization.
- Augmenting data and leveraging external datasets for enhanced robustness, especially in OOD contexts.
- Stacking strong and weak learners to synergistically reduce variance and bias.
Documented limitations include reliance on large pretrained models and significant training complexity. OOD generalization remains sensitive to the availability of some in-domain examples or external labeling.
Proposed directions include expanding speech and test diversity for universal MOS predictors, lighter SSL backbones for edge deployment, and improved phoneme/text alignment to more cleanly separate intelligibility and naturalness effects (Saeki et al., 2022).