WhisperVQ: Robust Speech Quality Assessments
- WhisperVQ is a technique that uses Whisper’s pre-trained, linguistically-rich embeddings within MOSA-Net+ to assess speech quality and intelligibility non-intrusively.
- The model employs dual branches—one extracting spectral and learnable features and another leveraging a frozen Whisper encoder—with a bi-directional LSTM for feature fusion.
- Empirical results on TMHINT-QI and VoiceMOS Challenge 2023 show significant improvements, including lower MSE and higher correlation scores, over traditional and SSL-based methods.
WhisperVQ refers to the integration of Whisper’s pre-trained embedding features in the MOSA-Net+ model architecture for robust, non-intrusive speech assessment, specifically targeting the prediction of both speech quality and intelligibility. Whisper, a large-scale weakly supervised model trained on audio–transcript pairs, produces embeddings with phonetic and linguistic richness conferring robustness to noise and enhancement artifacts. MOSA-Net+ operationalizes these embeddings in a multi-objective framework and demonstrates significant empirical gains over prior self-supervised learning (SSL) approaches and classical intrusive metrics in both controlled and challenge-style evaluation settings (Zezario et al., 2023).
1. Model Architecture and Insertion of Whisper Embeddings
The input raw waveform is processed in two parallel branches:
Branch A: The “spectral + learnable filters” pathway begins by extracting power-spectral (PS) features via short-time Fourier transform, , and learnable filterbank (LFB) features via SincConv, . These are concatenated along the feature axis to form , which is passed through a 12-layer 1D CNN yielding , where increases across layers (16, 32, 64, 128).
Branch B: The input is also processed by a frozen, pre-trained Whisper encoder . The encoder produces final-layer hidden states: (Medium: , Large-v3: ), which, after a lightweight adapter, become .
Feature Fusion: Outputs from both branches are concatenated along the time-frame axis:
This is processed by a bi-directional LSTM (hidden size 128) followed by a fully connected layer (128 → 128). Two task-specific heads—one for Quality, one for Intelligibility—apply attention pooling and global average pooling to yield final utterance-level predictions , .
2. Multi-Objective Loss Formulation
Dual-task training is implemented with a composite mean squared error (MSE) loss:
with individual components:
where , are ground-truth utterance labels and , are predicted scores (utterance), with , as frame-level predictions. , and are task- and granularity-balancing coefficients, and , are the batch size and number of frames per utterance, respectively.
3. Fusion of Whisper and Other SSL Features
Whisper embeddings () can be combined with self-supervised models such as Wav2Vec 2.0 (W2V) or MMS by extracting their respective embeddings, applying individual adapter layers (down-projection to ), and concatenating along the channel axis:
This fused feature is appended to the main CNN-driven pathway as described above. Notably, channel-wise concatenation suffices, with no additional weighting or gating. Empirically, combining Whisper and SSL model embeddings yields only marginal improvements over Whisper alone, suggesting that Whisper embeddings already sufficiently capture the relevant latent structure for subjective quality and intelligibility prediction.
4. Empirical Results: TMHINT-QI and Baselines
On the TMHINT-QI dataset, the following Pearson’s linear correlation (LCC), Spearman’s rank correlation (SRCC), and MSE values are reported for various feature branches in MOSA-Net+:
Quality Prediction
| Model | LCC | SRCC | MSE |
|---|---|---|---|
| HuBERT | 0.777 | 0.724 | 0.411 |
| Wav2Vec 2.0 | 0.804 | 0.758 | 0.360 |
| MMS | 0.811 | 0.766 | 0.362 |
| Whisper | 0.815 | 0.776 | 0.344 |
| Whisper+MMS | 0.816 | 0.777 | 0.344 |
| Whisper+W2V | 0.816 | 0.778 | 0.343 |
Intelligibility Prediction
| Model | LCC | SRCC | MSE |
|---|---|---|---|
| HuBERT | 0.740 | 0.698 | 0.023 |
| Wav2Vec 2.0 | 0.796 | 0.712 | 0.018 |
| MMS | 0.809 | 0.732 | 0.018 |
| Whisper | 0.807 | 0.738 | 0.017 |
| Whisper+MMS | 0.785 | 0.744 | 0.020 |
| Whisper+W2V | 0.807 | 0.733 | 0.017 |
When compared with previous state-of-the-art and classical baselines (MOSA-Net, MOS-SSL, intrusive metrics), MOSA-Net+ (with Whisper) achieves the highest correlations and lowest MSEs, outperforming both intrusive and SSL-based methodologies. For example, on Quality, MOSA-Net+ attains LCC=0.815, SRCC=0.776, MSE=0.344, significantly exceeding proxies such as CSIG/CBAK/COVL and surpassing MOSA-Net and MOS-SSL.
5. VoiceMOS Challenge 2023 and Robustness Evaluations
MOSA-Net+ was evaluated in the noisy-and-enhanced track of the VoiceMOS Challenge 2023. It achieved the top rank among nine systems across both utterance-level (UTT) and system-level (SYS) performance criteria:
| Metric | MOSA-Net+ | 2nd-Place (LE-SSL) |
|---|---|---|
| UTT-MSE | 0.343 | 0.688 |
| UTT-LCC | 0.803 | 0.684 |
| UTT-SRCC | 0.780 | 0.636 |
| UTT-KTAU | 0.594 | 0.475 |
| SYS-MSE | 0.082 | 0.404 |
| SYS-LCC | 0.952 | 0.769 |
| SYS-SRCC | 0.956 | 0.749 |
| SYS-KTAU | 0.828 | 0.635 |
Compared to baseline non-Whisper systems (e.g., UTMOS, SSL-MOS), MOSA-Net+ reduces system-level MSE by over 90% and increases correlation measures by approximately 0.18. This performance establishes Whisper embeddings as highly robust under diverse noise and enhancement conditions.
6. Analysis of Whisper Embedding Effectiveness
Whisper’s large-scale weak supervision (training on audio–transcript pairs) imparts the produced embeddings with both phonetic and linguistic content, enhancing their capacity to represent features critical for both quality and intelligibility prediction in noisy, variable conditions. The adapter layer is essential for reprojecting these high-dimensional features into a task-specific subspace suitable for downstream modeling. The empirical results indicate that Whisper’s representations facilitate generalization to unseen noise and enhancement scenarios without necessitating fine-tuning of the Whisper encoder.
A plausible implication is that models integrating Whisper-like weakly supervised embeddings may represent a new performance ceiling for non-intrusive speech assessment frameworks, especially in variability-prone evaluation environments.
7. Context and Implications for Non-Intrusive Speech Assessment
WhisperVQ, as instantiated in MOSA-Net+, demonstrates that weakly supervised, linguistic-rich speech representations can dominate classical and self-supervised acoustic features in non-intrusive speech assessment tasks. The architectural modularity allows for straightforward integration with other SSL features, but only marginal performance increases are observed beyond Whisper alone. This suggests a diminishing utility for fusing multiple SSL sources where the pre-trained embedding has sufficient linguistic and phonetic content.
Further, the model’s scalability to challenge-style, real-world evaluation (such as VoiceMOS 2023) signals a robust pathway forward in the field, potentially affecting both academic and industrial standards for automatic, reference-free speech quality and intelligibility assessment (Zezario et al., 2023).