MOSNet: Deep Learning based Objective Assessment for Voice Conversion
Abstract: Existing objective evaluation metrics for voice conversion (VC) are not always correlated with human perception. Therefore, training VC models with such criteria may not effectively improve naturalness and similarity of converted speech. In this paper, we propose deep learning-based assessment models to predict human ratings of converted speech. We adopt the convolutional and recurrent neural network models to build a mean opinion score (MOS) predictor, termed as MOSNet. The proposed models are tested on large-scale listening test results of the Voice Conversion Challenge (VCC) 2018. Experimental results show that the predicted scores of the proposed MOSNet are highly correlated with human MOS ratings at the system level while being fairly correlated with human MOS ratings at the utterance level. Meanwhile, we have modified MOSNet to predict the similarity scores, and the preliminary results show that the predicted scores are also fairly correlated with human ratings. These results confirm that the proposed models could be used as a computational evaluator to measure the MOS of VC systems to reduce the need for expensive human rating.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper is about teaching a computer to judge the quality of “voice conversion” speech the way people do. Voice conversion means taking a recording of one person and changing it so it sounds like another person, while keeping the words the same. People usually rate how natural the converted voice sounds and how similar it is to the target speaker using a 1–5 scale called MOS (Mean Opinion Score). But running big listening tests is slow and expensive. The authors build a system called MOSNet that listens to the audio and predicts the MOS automatically, aiming to match human opinions.
What questions did the paper try to answer?
- Can a deep learning model predict human naturalness ratings (MOS) for voice-converted speech well enough to reduce the need for large listening tests?
- Can the same idea also predict how similar the converted voice sounds to the target speaker?
- How close can predictions get at two levels:
- Utterance level: scores for individual audio clips.
- System level: scores averaged over all clips from a given voice conversion system (like grading the whole bakery vs. a single cookie).
How did the researchers study it?
They used a large, real-world dataset from the Voice Conversion Challenge (VCC) 2018, where thousands of people rated many converted speech samples for naturalness and similarity.
Here’s the approach in everyday terms:
- Turning sound into pictures: Each audio clip was turned into a “spectrogram,” which is like a heat map showing which tones (frequencies) are present over time. Think of it as a moving equalizer display captured frame by frame.
- Pattern-spotting models:
- CNN (Convolutional Neural Network): Great at spotting local patterns in images; here it finds time–frequency patterns in the spectrogram.
- BLSTM (Bidirectional Long Short-Term Memory): A kind of neural network with memory that looks at a sequence both forward and backward, like reading a sentence in both directions to understand context.
- CNN-BLSTM: Combines both—first spots local patterns (CNN), then understands long-term changes over time (BLSTM).
- Predicting scores per moment, then averaging: The model predicts a tiny “quality score” for each short time slice (“frame”) of the audio and then averages these to get one score for the whole clip. This helps the model learn stable patterns instead of being fooled by brief glitches.
- Training with human scores: They trained the models using the human MOS as the “correct answers,” trying to make the model’s predictions as close as possible. They measured success using correlation (how well the ups and downs of predictions match human ratings) and error.
They also tried a modified version to predict speaker similarity, by feeding in a pair of clips (converted voice and target voice) and asking the model to say how similar they sound.
What did they find?
Here are the main results and why they matter:
- Strong at the system level: When averaging over all clips from each voice conversion system, MOSNet’s predictions matched human judgments very well. The best model (CNN-BLSTM) reached a very high correlation (about 0.96), close to how consistent humans are with each other at that level. This means MOSNet can reliably rank and compare different voice conversion systems without needing humans to rate everything.
- Fair at the utterance level: For individual clips, predictions had a moderate correlation with human scores (about 0.64). Even humans don’t agree perfectly on single clips (the paper shows human–human correlation is only around 0.80 at this level), so there’s a natural limit here. Still, the model does a reasonable job.
- Predicting similarity is possible: With a simple extension, MOSNet could also predict whether the converted voice sounds like the target speaker with about 70% accuracy and fair correlation to human ratings—useful for judging identity similarity, not just naturalness.
- Generalizes to other data: A model trained on VCC 2018 also worked well on the earlier VCC 2016 challenge (system-level correlation ~0.92), suggesting it isn’t just memorizing one dataset.
- Why frame-by-frame training helped: Including “frame-level” errors during training made the model’s per-frame scores steadier, which led to better overall clip predictions.
Why does this matter, and what could come next?
- Faster, cheaper evaluation: MOSNet can act like an automated, “always available” listener. This can save time and money by reducing how many human ratings are needed during development.
- Better model training: If we can reliably score naturalness and similarity automatically, researchers can more easily tune and improve voice conversion systems.
- Real-world limit acknowledged: Even people disagree on single-clip ratings, so utterance-level predictions will never be perfect. But system-level predictions are very dependable, making MOSNet especially useful for comparing systems.
- Next steps: The authors suggest improving the training goals (beyond simple averaging errors) and model design to better handle tricky cases, balance score ranges, and align even more closely with how humans hear quality and similarity.
In short, MOSNet is like a trained judge that listens to converted speech and gives scores much like people do—especially reliable when comparing whole systems—helping push voice conversion research forward more efficiently.
Collections
Sign up for free to add this paper to one or more collections.