Papers
Topics
Authors
Recent
Search
2000 character limit reached

WhisperVQ: Robust Speech Quality Assessments

Updated 30 January 2026
  • WhisperVQ is a technique that uses Whisper’s pre-trained, linguistically-rich embeddings within MOSA-Net+ to assess speech quality and intelligibility non-intrusively.
  • The model employs dual branches—one extracting spectral and learnable features and another leveraging a frozen Whisper encoder—with a bi-directional LSTM for feature fusion.
  • Empirical results on TMHINT-QI and VoiceMOS Challenge 2023 show significant improvements, including lower MSE and higher correlation scores, over traditional and SSL-based methods.

WhisperVQ refers to the integration of Whisper’s pre-trained embedding features in the MOSA-Net+ model architecture for robust, non-intrusive speech assessment, specifically targeting the prediction of both speech quality and intelligibility. Whisper, a large-scale weakly supervised model trained on audio–transcript pairs, produces embeddings with phonetic and linguistic richness conferring robustness to noise and enhancement artifacts. MOSA-Net+ operationalizes these embeddings in a multi-objective framework and demonstrates significant empirical gains over prior self-supervised learning (SSL) approaches and classical intrusive metrics in both controlled and challenge-style evaluation settings (Zezario et al., 2023).

1. Model Architecture and Insertion of Whisper Embeddings

The input raw waveform yy is processed in two parallel branches:

Branch A: The “spectral + learnable filters” pathway begins by extracting power-spectral (PS) features via short-time Fourier transform, PSRT×257PS \in \mathbb{R}^{T \times 257}, and learnable filterbank (LFB) features via SincConv, LFBRT×257LFB \in \mathbb{R}^{T \times 257}. These are concatenated along the feature axis to form ConcatPS+LFBRT×514Concat_{PS+LFB} \in \mathbb{R}^{T \times 514}, which is passed through a 12-layer 1D CNN yielding hconvRT×Ch_{conv} \in \mathbb{R}^{T \times C}, where CC increases across layers (16, 32, 64, 128).

Branch B: The input yy is also processed by a frozen, pre-trained Whisper encoder EWSE_{WS}. The encoder produces final-layer hidden states: WS=Whisper(y)RTWS×DWSWS = Whisper(y) \in \mathbb{R}^{T_{WS} \times D_{WS}} (Medium: DWS=1024D_{WS}=1024, Large-v3: DWS=1280D_{WS}=1280), which, after a lightweight adapter, become WSadapterRTWS×128WS_{adapter} \in \mathbb{R}^{T_{WS} \times 128}.

Feature Fusion: Outputs from both branches are concatenated along the time-frame axis:

Concatall=[hconv;WSadapter]R(T+TWS)×(C+128)Concat_{all} = [ h_{conv} ; WS_{adapter} ] \in \mathbb{R}^{(T+T_{WS}) \times (C + 128)}

This is processed by a bi-directional LSTM (hidden size 128) followed by a fully connected layer (128 → 128). Two task-specific heads—one for Quality, one for Intelligibility—apply attention pooling and global average pooling to yield final utterance-level predictions Q^n\hat{Q}_n, I^n\hat{I}_n.

2. Multi-Objective Loss Formulation

Dual-task training is implemented with a composite mean squared error (MSE) loss:

LAll=γ1LQuality+γ2LIntelligibilityL_\text{All} = \gamma_1 L_\text{Quality} + \gamma_2 L_\text{Intelligibility}

with individual components:

LQuality=1Nn=1N[(QnQ^n)2+αQFnl=1Fn(Qnq^nl)2]L_\text{Quality} = \frac{1}{N}\sum_{n=1}^N \left[ (Q_n - \hat{Q}_n)^2 + \frac{\alpha_Q}{F_n} \sum_{l=1}^{F_n} (Q_n - \hat{q}_{nl})^2 \right]

LIntelligibility=1Nn=1N[(InI^n)2+αIFnl=1Fn(Ini^nl)2]L_\text{Intelligibility} = \frac{1}{N}\sum_{n=1}^N \left[ (I_n - \hat{I}_n)^2 + \frac{\alpha_I}{F_n} \sum_{l=1}^{F_n} (I_n - \hat{i}_{nl})^2 \right]

where QnQ_n, InI_n are ground-truth utterance labels and Q^n\hat{Q}_n, I^n\hat{I}_n are predicted scores (utterance), with q^nl\hat{q}_{nl}, i^nl\hat{i}_{nl} as frame-level predictions. γ1,γ2\gamma_1, \gamma_2, and αQ,αI\alpha_Q, \alpha_I are task- and granularity-balancing coefficients, and NN, FnF_n are the batch size and number of frames per utterance, respectively.

3. Fusion of Whisper and Other SSL Features

Whisper embeddings (WSWS) can be combined with self-supervised models such as Wav2Vec 2.0 (W2V) or MMS by extracting their respective embeddings, applying individual adapter layers (down-projection to R,128\mathbb{R}^{*,128}), and concatenating along the channel axis:

Hfused=[AdapterWS(WS)    AdapterSSL(XSSL)]H_\text{fused} = [\text{Adapter}_{WS}(WS) \;\big\|\; \text{Adapter}_{SSL}(X_{SSL})]

This fused feature is appended to the main CNN-driven pathway as described above. Notably, channel-wise concatenation suffices, with no additional weighting or gating. Empirically, combining Whisper and SSL model embeddings yields only marginal improvements over Whisper alone, suggesting that Whisper embeddings already sufficiently capture the relevant latent structure for subjective quality and intelligibility prediction.

4. Empirical Results: TMHINT-QI and Baselines

On the TMHINT-QI dataset, the following Pearson’s linear correlation (LCC), Spearman’s rank correlation (SRCC), and MSE values are reported for various feature branches in MOSA-Net+:

Quality Prediction

Model LCC SRCC MSE
HuBERT 0.777 0.724 0.411
Wav2Vec 2.0 0.804 0.758 0.360
MMS 0.811 0.766 0.362
Whisper 0.815 0.776 0.344
Whisper+MMS 0.816 0.777 0.344
Whisper+W2V 0.816 0.778 0.343

Intelligibility Prediction

Model LCC SRCC MSE
HuBERT 0.740 0.698 0.023
Wav2Vec 2.0 0.796 0.712 0.018
MMS 0.809 0.732 0.018
Whisper 0.807 0.738 0.017
Whisper+MMS 0.785 0.744 0.020
Whisper+W2V 0.807 0.733 0.017

When compared with previous state-of-the-art and classical baselines (MOSA-Net, MOS-SSL, intrusive metrics), MOSA-Net+ (with Whisper) achieves the highest correlations and lowest MSEs, outperforming both intrusive and SSL-based methodologies. For example, on Quality, MOSA-Net+ attains LCC=0.815, SRCC=0.776, MSE=0.344, significantly exceeding proxies such as CSIG/CBAK/COVL and surpassing MOSA-Net and MOS-SSL.

5. VoiceMOS Challenge 2023 and Robustness Evaluations

MOSA-Net+ was evaluated in the noisy-and-enhanced track of the VoiceMOS Challenge 2023. It achieved the top rank among nine systems across both utterance-level (UTT) and system-level (SYS) performance criteria:

Metric MOSA-Net+ 2nd-Place (LE-SSL)
UTT-MSE 0.343 0.688
UTT-LCC 0.803 0.684
UTT-SRCC 0.780 0.636
UTT-KTAU 0.594 0.475
SYS-MSE 0.082 0.404
SYS-LCC 0.952 0.769
SYS-SRCC 0.956 0.749
SYS-KTAU 0.828 0.635

Compared to baseline non-Whisper systems (e.g., UTMOS, SSL-MOS), MOSA-Net+ reduces system-level MSE by over 90% and increases correlation measures by approximately 0.18. This performance establishes Whisper embeddings as highly robust under diverse noise and enhancement conditions.

6. Analysis of Whisper Embedding Effectiveness

Whisper’s large-scale weak supervision (training on audio–transcript pairs) imparts the produced embeddings with both phonetic and linguistic content, enhancing their capacity to represent features critical for both quality and intelligibility prediction in noisy, variable conditions. The adapter layer is essential for reprojecting these high-dimensional features into a task-specific subspace suitable for downstream modeling. The empirical results indicate that Whisper’s representations facilitate generalization to unseen noise and enhancement scenarios without necessitating fine-tuning of the Whisper encoder.

A plausible implication is that models integrating Whisper-like weakly supervised embeddings may represent a new performance ceiling for non-intrusive speech assessment frameworks, especially in variability-prone evaluation environments.

7. Context and Implications for Non-Intrusive Speech Assessment

WhisperVQ, as instantiated in MOSA-Net+, demonstrates that weakly supervised, linguistic-rich speech representations can dominate classical and self-supervised acoustic features in non-intrusive speech assessment tasks. The architectural modularity allows for straightforward integration with other SSL features, but only marginal performance increases are observed beyond Whisper alone. This suggests a diminishing utility for fusing multiple SSL sources where the pre-trained embedding has sufficient linguistic and phonetic content.

Further, the model’s scalability to challenge-style, real-world evaluation (such as VoiceMOS 2023) signals a robust pathway forward in the field, potentially affecting both academic and industrial standards for automatic, reference-free speech quality and intelligibility assessment (Zezario et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WhisperVQ.