Papers
Topics
Authors
Recent
Search
2000 character limit reached

MOSNet: Deep Learning based Objective Assessment for Voice Conversion

Published 17 Apr 2019 in cs.SD, cs.LG, and eess.AS | (1904.08352v3)

Abstract: Existing objective evaluation metrics for voice conversion (VC) are not always correlated with human perception. Therefore, training VC models with such criteria may not effectively improve naturalness and similarity of converted speech. In this paper, we propose deep learning-based assessment models to predict human ratings of converted speech. We adopt the convolutional and recurrent neural network models to build a mean opinion score (MOS) predictor, termed as MOSNet. The proposed models are tested on large-scale listening test results of the Voice Conversion Challenge (VCC) 2018. Experimental results show that the predicted scores of the proposed MOSNet are highly correlated with human MOS ratings at the system level while being fairly correlated with human MOS ratings at the utterance level. Meanwhile, we have modified MOSNet to predict the similarity scores, and the preliminary results show that the predicted scores are also fairly correlated with human ratings. These results confirm that the proposed models could be used as a computational evaluator to measure the MOS of VC systems to reduce the need for expensive human rating.

Citations (251)

Summary

  • The paper introduces MOSNet, a deep learning model that employs CNN, BLSTM, and hybrid architectures to predict MOS ratings with high system-level correlation.
  • It improves traditional metrics like MCD by minimizing both utterance and frame-level MSE for a closer match to human evaluations.
  • Experimental results on VCC 2018 and VCC 2016 datasets demonstrate MOSNet’s strong performance and potential for automating VC evaluation.

MOSNet: Deep Learning-based Objective Assessment for Voice Conversion

The paper introduces MOSNet, a deep learning-based approach for the objective assessment of voice conversion (VC) systems, aiming to predict human ratings of converted speech more accurately than traditional objective measures. The authors address the limitations of metrics like Mel-cepstral distance (MCD), which do not align well with human perception in assessing speech quality. The proposed system adopts convolutional and recurrent neural network architectures to develop a Mean Opinion Score (MOS) predictor that correlates well with subjective human evaluations.

Summary of Methodology

The MOSNet model utilizes raw magnitude spectrograms as input features and employs three different neural network architectures: CNN, BLSTM, and a hybrid CNN-BLSTM. These architectures are used to extract features for predicting MOS, with the CNN-BLSTM variant demonstrating the best performance across the board. The model is trained to minimize both utterance and frame-level mean squared error (MSE), thereby aligning predictions closely with human ratings. Notably, a novel objective function incorporates frame-level errors, enhancing the model's ability to predict utterance-level MOS.

Experimental Validation and Results

The experiments leverage data from the Voice Conversion Challenge (VCC) 2018, consisting of comprehensive human evaluations of converted voice samples. Results indicate that MOSNet's predictions exhibit high correlation (LCC of up to 0.957) with human scores at the system level and fair correlation at the utterance level (LCC up to 0.642), outperforming existing methods. The study also demonstrates the model's generalization capability by applying its training results from VCC 2018 to the VCC 2016 data, maintaining strong correlation metrics.

Furthermore, MOSNet's architecture is slightly modified to predict similarity scores between converted and target speech samples with fair correlation results, indicating its versatility in assessing both naturalness and similarity in VC systems.

Implications and Future Work

The introduction of MOSNet provides a meaningful advancement for automating VC evaluation, which traditionally depends on resource-intensive human evaluations. This model lays the groundwork for deploying non-intrusive evaluation systems capable of consistently predicting perceptual quality metrics, thereby reducing evaluation costs and potentially accelerating developments in VC technology.

Theoretical implications center on aligning machine learning models with human perception, encouraging further research into integrating perceptual theory with neural architectures. Future work may explore alternative alignment methods between human perception and computational models, potentially incorporating psychoacoustic principles into MOSNet’s architecture to further bridge the discrepancy between machine predictions and human evaluations.

The study opens avenues for refining objective speech assessment methodologies, possibly extending MOSNet’s applicability to other domains of speech processing, such as speech synthesis or speech enhancement. The potential to refine model components, improve generalization to diverse datasets, and incorporate additional acoustic features remains a subject of future exploration for both academia and industry stakeholders.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview

This paper is about teaching a computer to judge the quality of “voice conversion” speech the way people do. Voice conversion means taking a recording of one person and changing it so it sounds like another person, while keeping the words the same. People usually rate how natural the converted voice sounds and how similar it is to the target speaker using a 1–5 scale called MOS (Mean Opinion Score). But running big listening tests is slow and expensive. The authors build a system called MOSNet that listens to the audio and predicts the MOS automatically, aiming to match human opinions.

What questions did the paper try to answer?

  • Can a deep learning model predict human naturalness ratings (MOS) for voice-converted speech well enough to reduce the need for large listening tests?
  • Can the same idea also predict how similar the converted voice sounds to the target speaker?
  • How close can predictions get at two levels:
    • Utterance level: scores for individual audio clips.
    • System level: scores averaged over all clips from a given voice conversion system (like grading the whole bakery vs. a single cookie).

How did the researchers study it?

They used a large, real-world dataset from the Voice Conversion Challenge (VCC) 2018, where thousands of people rated many converted speech samples for naturalness and similarity.

Here’s the approach in everyday terms:

  • Turning sound into pictures: Each audio clip was turned into a “spectrogram,” which is like a heat map showing which tones (frequencies) are present over time. Think of it as a moving equalizer display captured frame by frame.
  • Pattern-spotting models:
    • CNN (Convolutional Neural Network): Great at spotting local patterns in images; here it finds time–frequency patterns in the spectrogram.
    • BLSTM (Bidirectional Long Short-Term Memory): A kind of neural network with memory that looks at a sequence both forward and backward, like reading a sentence in both directions to understand context.
    • CNN-BLSTM: Combines both—first spots local patterns (CNN), then understands long-term changes over time (BLSTM).
  • Predicting scores per moment, then averaging: The model predicts a tiny “quality score” for each short time slice (“frame”) of the audio and then averages these to get one score for the whole clip. This helps the model learn stable patterns instead of being fooled by brief glitches.
  • Training with human scores: They trained the models using the human MOS as the “correct answers,” trying to make the model’s predictions as close as possible. They measured success using correlation (how well the ups and downs of predictions match human ratings) and error.

They also tried a modified version to predict speaker similarity, by feeding in a pair of clips (converted voice and target voice) and asking the model to say how similar they sound.

What did they find?

Here are the main results and why they matter:

  • Strong at the system level: When averaging over all clips from each voice conversion system, MOSNet’s predictions matched human judgments very well. The best model (CNN-BLSTM) reached a very high correlation (about 0.96), close to how consistent humans are with each other at that level. This means MOSNet can reliably rank and compare different voice conversion systems without needing humans to rate everything.
  • Fair at the utterance level: For individual clips, predictions had a moderate correlation with human scores (about 0.64). Even humans don’t agree perfectly on single clips (the paper shows human–human correlation is only around 0.80 at this level), so there’s a natural limit here. Still, the model does a reasonable job.
  • Predicting similarity is possible: With a simple extension, MOSNet could also predict whether the converted voice sounds like the target speaker with about 70% accuracy and fair correlation to human ratings—useful for judging identity similarity, not just naturalness.
  • Generalizes to other data: A model trained on VCC 2018 also worked well on the earlier VCC 2016 challenge (system-level correlation ~0.92), suggesting it isn’t just memorizing one dataset.
  • Why frame-by-frame training helped: Including “frame-level” errors during training made the model’s per-frame scores steadier, which led to better overall clip predictions.

Why does this matter, and what could come next?

  • Faster, cheaper evaluation: MOSNet can act like an automated, “always available” listener. This can save time and money by reducing how many human ratings are needed during development.
  • Better model training: If we can reliably score naturalness and similarity automatically, researchers can more easily tune and improve voice conversion systems.
  • Real-world limit acknowledged: Even people disagree on single-clip ratings, so utterance-level predictions will never be perfect. But system-level predictions are very dependable, making MOSNet especially useful for comparing systems.
  • Next steps: The authors suggest improving the training goals (beyond simple averaging errors) and model design to better handle tricky cases, balance score ranges, and align even more closely with how humans hear quality and similarity.

In short, MOSNet is like a trained judge that listens to converted speech and gives scores much like people do—especially reliable when comparing whole systems—helping push voice conversion research forward more efficiently.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.