Ichigo: Mixed-Modal Realtime Voice Assistant
- The paper demonstrates that integrating frozen Whisper embeddings into MOSA-Net+ significantly improves the estimation of speech quality and intelligibility under varied acoustic conditions.
- It employs a dual-branch architecture combining conventional spectral features with Whisper-based embeddings processed via a lightweight adapter and Bi-LSTM for robust fusion.
- Experimental evaluations on TMHINT-QI and VoiceMOS Challenge 2023 reveal marked gains in correlation metrics and reduced MSE compared to baseline models.
WhisperVQ refers to the application of embedding features extracted from the Whisper large-scale weakly supervised model within the MOSA-Net+ speech assessment framework. The integration of Whisper embeddings with conventional spectral features in MOSA-Net+ yields performance gains in estimating subjective speech quality and intelligibility under noisy and enhanced conditions, as demonstrated by extensive evaluation on the TMHINT-QI dataset and the VoiceMOS Challenge 2023 (Zezario et al., 2023).
1. Model Architecture and Embedding Integration
MOSA-Net+ receives raw waveform input and processes it via two parallel branches. The first branch extracts conventional spectral features: power-spectral (PS) features via STFT () and learnable-filterbank (LFB) features via SincConv (), concatenated and transformed by a 12-layer 1D CNN to produce (channels increase from 16 to 128).
The second branch utilizes a frozen pre-trained Whisper encoder . The final-layer hidden states, , are extracted ( for Medium, $1280$ for Large-v3) and projected by a lightweight adapter to ().
Feature fusion is performed by stacking the time-frame sequences: . This is processed by a Bi-LSTM (hidden size 128) and a fully-connected layer. Two task-specific heads (for Quality and Intelligibility) utilize attention pooling, a linear layer, and global average pooling to produce frame-level () and utterance-level () predictions.
2. Multi-Objective Loss Formulation
Training employs a joint multi-objective loss across Quality and Intelligibility at both frame and utterance levels: where
with utterances, frames in utterance , and controlling task and granularity weights.
3. Fusion Strategies with Other Self-Supervised Learning (SSL) Features
Whisper embeddings can be combined with alternative SSL features (e.g., Wav2Vec 2.0, MMS) by extracting each model’s embeddings, passing through their respective adapters to -dimensional spaces, and concatenating channel-wise: No weighted sum or gating mechanisms are employed. This fused representation is appended to the CNN branch and processed identically to the single-model case. Experimental results indicate that adding SSL features provides only marginal improvements over Whisper alone.
4. Performance on TMHINT-QI Dataset
Evaluation metrics include Pearson’s linear correlation (LCC), Spearman’s rank correlation (SRCC), and mean squared error (MSE) for both Quality and Intelligibility tasks.
Table: TMHINT-QI Model Comparison
| Model | Quality LCC | Quality SRCC | Quality MSE | Intelligibility LCC | Intelligibility SRCC | Intelligibility MSE |
|---|---|---|---|---|---|---|
| HuBERT | 0.777 | 0.724 | 0.411 | 0.740 | 0.698 | 0.023 |
| Wav2Vec 2.0 | 0.804 | 0.758 | 0.360 | 0.796 | 0.712 | 0.018 |
| MMS | 0.811 | 0.766 | 0.362 | 0.809 | 0.732 | 0.018 |
| Whisper | 0.815 | 0.776 | 0.344 | 0.807 | 0.738 | 0.017 |
| Whisper+MMS | 0.816 | 0.777 | 0.344 | 0.785 | 0.744 | 0.020 |
| Whisper+W2V | 0.816 | 0.778 | 0.343 | 0.807 | 0.733 | 0.017 |
In comparison to previous models (MOSA-Net, MOS-SSL, intrusive metrics), Whisper-based MOSA-Net+ demonstrates superior correlations and reduced MSE for both quality and intelligibility, confirming the efficacy of Whisper embeddings.
5. Benchmark Performance in VoiceMOS Challenge 2023
MOSA-Net+ achieved top-ranked results among nine systems in the noisy-and-enhanced track of the VoiceMOS Challenge 2023 across both utterance- and system-level metrics.
Table: VoiceMOS 2023 MOSA-Net+ vs. 2nd Place
| Metric | MOSA-Net+ | 2nd-Place (LE-SSL) |
|---|---|---|
| UTT-MSE | 0.343 | 0.688 |
| UTT-LCC | 0.803 | 0.684 |
| UTT-SRCC | 0.780 | 0.636 |
| UTT-KTAU | 0.594 | 0.475 |
| SYS-MSE | 0.082 | 0.404 |
| SYS-LCC | 0.952 | 0.769 |
| SYS-SRCC | 0.956 | 0.749 |
| SYS-KTAU | 0.828 | 0.635 |
Relative to baseline SSL-MOS and UTMOS systems, MOSA-Net+ reduces MSE by over 90% at system level and increases correlation metrics by approximately 0.18.
6. Embedding Robustness and Underlying Mechanism
Whisper’s weak supervision—joint training on audio and transcript pairs—results in embeddings containing rich phonetic and linguistic cues, enhancing noise robustness over purely acoustic SSL features. The adapter layer projects Whisper’s high-dimensional embeddings into a compact, task-specific subspace, enabling MOSA-Net+ to emphasize phoneme-level distortions that align closely with human judgments of quality and intelligibility.
Empirical results indicate that Whisper embeddings drive consistent improvements in both correlation and error metrics, and generalize robustly to unseen noise and enhancement conditions without fine-tuning.
7. Comparative Effects and Interpretive Implications
The marginal benefit from concatenating Whisper embeddings with alternative SSL features suggests that Whisper’s embedding space already encapsulates the majority of latent dimensions needed for robust non-intrusive speech assessment. A plausible implication is that weakly supervised training with transcript data leverages cross-modal structure beneficial for speech quality and intelligibility prediction. Future work may focus on further exploiting transcript-conditioned models for additional downstream assessment tasks or combining with domain-adaptive adapters.
WhisperVQ establishes Whisper-integrated MOSA-Net+ as a state-of-the-art framework in robust non-intrusive speech assessment, supported by competitive results in both controlled (TMHINT-QI) and noisy/enhanced (VoiceMOS Challenge) conditions (Zezario et al., 2023).