Ichigo: Mixed-Modal Realtime Voice Assistant

Updated 30 January 2026

The paper demonstrates that integrating frozen Whisper embeddings into MOSA-Net+ significantly improves the estimation of speech quality and intelligibility under varied acoustic conditions.
It employs a dual-branch architecture combining conventional spectral features with Whisper-based embeddings processed via a lightweight adapter and Bi-LSTM for robust fusion.
Experimental evaluations on TMHINT-QI and VoiceMOS Challenge 2023 reveal marked gains in correlation metrics and reduced MSE compared to baseline models.

WhisperVQ refers to the application of embedding features extracted from the Whisper large-scale weakly supervised model within the MOSA-Net+ speech assessment framework. The integration of Whisper embeddings with conventional spectral features in MOSA-Net+ yields performance gains in estimating subjective speech quality and intelligibility under noisy and enhanced conditions, as demonstrated by extensive evaluation on the TMHINT-QI dataset and the VoiceMOS Challenge 2023 (Zezario et al., 2023).

1. Model Architecture and Embedding Integration

MOSA-Net+ receives raw waveform input $y$ and processes it via two parallel branches. The first branch extracts conventional spectral features: power-spectral (PS) features via STFT ( $\mathrm{PS}\in\mathbb{R}^{T\times F}, F=257$ ) and learnable-filterbank (LFB) features via SincConv ( $\mathrm{LFB}\in\mathbb{R}^{T\times F}$ ), concatenated and transformed by a 12-layer 1D CNN to produce $\mathbf{h}_{\text{conv}}\in\mathbb{R}^{T\times C}$ (channels $C$ increase from 16 to 128).

The second branch utilizes a frozen pre-trained Whisper encoder $E_{\text{WS}}$ . The final-layer hidden states, $\mathrm{WS}=\text{Whisper}(y)\in\mathbb{R}^{T_{\text{WS}}\times D_{\text{WS}}}$ , are extracted ( $D_{\text{WS}}=1024$ for Medium, $1280$ for Large-v3) and projected by a lightweight adapter to $\mathrm{WS}_{\text{adapter}}\in\mathbb{R}^{T_{\text{WS}}\times D_{a}}$ ( $D_{a}=128$ ).

Feature fusion is performed by stacking the time-frame sequences: $\mathrm{Concat}_{\text{all}}=[\mathbf{h}_{\text{conv}} ; \mathrm{WS}_{\text{adapter}}]\in\mathbb{R}^{(T+T_{\text{WS}})\times(C+D_{a})}$ . This is processed by a Bi-LSTM (hidden size 128) and a fully-connected layer. Two task-specific heads (for Quality and Intelligibility) utilize attention pooling, a linear layer, and global average pooling to produce frame-level ( $\hat{q}_{nl}, \hat{i}_{nl}$ ) and utterance-level ( $\hat{Q}_{n}, \hat{I}_{n}$ ) predictions.

2. Multi-Objective Loss Formulation

Training employs a joint multi-objective loss across Quality and Intelligibility at both frame and utterance levels: $L_{\mathrm{All}} = \gamma_{1} L_{\mathrm{Quality}} + \gamma_{2} L_{\mathrm{Intelligibility}}$ where

$L_{\mathrm{Quality}} = \frac{1}{N} \sum_{n=1}^{N} \left[ (Q_{n} - \hat{Q}_{n})^{2} + \frac{\alpha_{Q}}{F_{n}} \sum_{l=1}^{F_{n}} (Q_{n} - \hat{q}_{nl})^{2} \right]$

$L_{\mathrm{Intelligibility}} = \frac{1}{N} \sum_{n=1}^{N} \left[ (I_{n} - \hat{I}_{n})^{2} + \frac{\alpha_{I}}{F_{n}} \sum_{l=1}^{F_{n}} (I_{n} - \hat{i}_{nl})^{2} \right]$

with $N$ utterances, $F_{n}$ frames in utterance $n$ , and $\gamma_1, \gamma_2, \alpha_Q, \alpha_I$ controlling task and granularity weights.

3. Fusion Strategies with Other Self-Supervised Learning (SSL) Features

Whisper embeddings can be combined with alternative SSL features (e.g., Wav2Vec 2.0, MMS) by extracting each model’s embeddings, passing through their respective adapters to $D_a$ -dimensional spaces, and concatenating channel-wise: $H_{\mathrm{fused}} = [\mathrm{Adapter}_{\mathrm{WS}}(\mathrm{WS}) \ \| \ \mathrm{Adapter}_{\mathrm{SSL}}(X_{\mathrm{SSL}})]$ No weighted sum or gating mechanisms are employed. This fused representation is appended to the CNN branch and processed identically to the single-model case. Experimental results indicate that adding SSL features provides only marginal improvements over Whisper alone.

4. Performance on TMHINT-QI Dataset

Evaluation metrics include Pearson’s linear correlation (LCC), Spearman’s rank correlation (SRCC), and mean squared error (MSE) for both Quality and Intelligibility tasks.

Table: TMHINT-QI Model Comparison

Model	Quality LCC	Quality SRCC	Quality MSE	Intelligibility LCC	Intelligibility SRCC	Intelligibility MSE
HuBERT	0.777	0.724	0.411	0.740	0.698	0.023
Wav2Vec 2.0	0.804	0.758	0.360	0.796	0.712	0.018
MMS	0.811	0.766	0.362	0.809	0.732	0.018
Whisper	0.815	0.776	0.344	0.807	0.738	0.017
Whisper+MMS	0.816	0.777	0.344	0.785	0.744	0.020
Whisper+W2V	0.816	0.778	0.343	0.807	0.733	0.017

In comparison to previous models (MOSA-Net, MOS-SSL, intrusive metrics), Whisper-based MOSA-Net+ demonstrates superior correlations and reduced MSE for both quality and intelligibility, confirming the efficacy of Whisper embeddings.

5. Benchmark Performance in VoiceMOS Challenge 2023

MOSA-Net+ achieved top-ranked results among nine systems in the noisy-and-enhanced track of the VoiceMOS Challenge 2023 across both utterance- and system-level metrics.

Table: VoiceMOS 2023 MOSA-Net+ vs. 2nd Place

Metric	MOSA-Net+	2nd-Place (LE-SSL)
UTT-MSE	0.343	0.688
UTT-LCC	0.803	0.684
UTT-SRCC	0.780	0.636
UTT-KTAU	0.594	0.475
SYS-MSE	0.082	0.404
SYS-LCC	0.952	0.769
SYS-SRCC	0.956	0.749
SYS-KTAU	0.828	0.635

Relative to baseline SSL-MOS and UTMOS systems, MOSA-Net+ reduces MSE by over 90% at system level and increases correlation metrics by approximately 0.18.

6. Embedding Robustness and Underlying Mechanism

Whisper’s weak supervision—joint training on audio and transcript pairs—results in embeddings containing rich phonetic and linguistic cues, enhancing noise robustness over purely acoustic SSL features. The adapter layer projects Whisper’s high-dimensional embeddings into a compact, task-specific subspace, enabling MOSA-Net+ to emphasize phoneme-level distortions that align closely with human judgments of quality and intelligibility.

Empirical results indicate that Whisper embeddings drive consistent improvements in both correlation and error metrics, and generalize robustly to unseen noise and enhancement conditions without fine-tuning.

7. Comparative Effects and Interpretive Implications

The marginal benefit from concatenating Whisper embeddings with alternative SSL features suggests that Whisper’s embedding space already encapsulates the majority of latent dimensions needed for robust non-intrusive speech assessment. A plausible implication is that weakly supervised training with transcript data leverages cross-modal structure beneficial for speech quality and intelligibility prediction. Future work may focus on further exploiting transcript-conditioned models for additional downstream assessment tasks or combining with domain-adaptive adapters.

WhisperVQ establishes Whisper-integrated MOSA-Net+ as a state-of-the-art framework in robust non-intrusive speech assessment, supported by competitive results in both controlled (TMHINT-QI) and noisy/enhanced (VoiceMOS Challenge) conditions (Zezario et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

A Study on Incorporating Whisper for Robust Speech Assessment (2023)

Topic to Video (Beta)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant.