Speech Quality Assessment for Enhanced Speech
- Speech Quality Assessment for Enhanced Speech is the evaluation of perceptual speech quality using MOS and dimensions like noisiness, coloration, and discontinuity.
- It combines classical intrusive metrics with modern non-intrusive deep learning models to enable scalable, accurate prediction of speech quality.
- Recent advances incorporate multimodal assessment and auditory LLMs to deliver natural-language feedback and improve both training and diagnostic precision.
Speech quality assessment for enhanced speech comprises a set of methodologies, models, and evaluation protocols designed to predict or analyze the perceived quality of speech signals after processing by enhancement algorithms. This domain bridges classical intrusive metrics anchored to human Mean Opinion Score (MOS) judgments and advanced data-driven, non-intrusive, and multimodal systems capable of multidimensional and interactive assessment. The following article systematically presents the core methods, models, data, evaluation protocols, and integrative practices that define the field.
1. Fundamental Concepts and Quality Dimensions
Speech quality assessment (SQA) seeks to estimate human subjective evaluation of speech—commonly quantified as MOS on a 1–5 scale—under both channel degradations and algorithmic enhancement. Classical assessment relies on time-intensive, expert-controlled listening tests, while modern methodologies emphasize scalable, automatic, and perceptually aligned proxies. In the context of speech enhancement, "quality" subsumes not only basic clarity and intelligibility but also perceptual dimensions such as noisiness, coloration, discontinuity, and loudness, formalized in ITU-T P.804 and operationalized in modern datasets (e.g., NISQA corpus, P.804 crowdsourcing toolkit) (Mittag et al., 2021, Naderi et al., 2023, Monjur et al., 9 Dec 2025).
A multi-dimensional approach is now established as essential. Research and challenge protocols routinely assess:
- Noisiness: amount and perceptual character of background or coding noise
- Coloration: alterations in timbral, frequency response, or spectral balance
- Discontinuity: transient or non-stationary artifacts such as clicks, dropouts, or musical noise
- Loudness: deviation from expected or comfortable listening levels
- Overall Quality: holistic MOS integrating all above properties
The contemporary practice in SQA is to jointly predict overall and dimension-level scores, providing granular diagnostic feedback for algorithm development (Mittag et al., 2021, Monjur et al., 9 Dec 2025, Naderi et al., 2023).
2. Objective Metrics: Intrusive and Non-Intrusive Approaches
There are two broad categories of objective metrics:
- Intrusive (Full-reference): Require both the processed/enhanced and a clean reference signal.
- Non-intrusive (No-reference): Evaluate only the processed/enhanced signal, enabling deployment in practical scenarios where the clean reference is not accessible.
Widely adopted intrusive metrics include:
| Metric | Type | Output Range | System-level SRCC | Description |
|---|---|---|---|---|
| PESQ | Intrusive | [–0.5,4.5] | ≥0.90 | ITU-T P.862, perceptual model |
| POLQA | Intrusive | [1,5] | ≥0.92 | ITU-T P.863, wideband/super-wideband |
| STOI | Intrusive | [0,1] | ≥0.90 | Short-time objective intelligibility |
| SI-SDR | Intrusive | (–∞,∞) dB | 0.7–0.8 | Scale-invariant SNR/distorion |
Non-intrusive (DNN-based) metrics include:
| Metric | Output Range | System-level SRCC | Description |
|---|---|---|---|
| DNSMOS | [1,5] | ≈0.94 | DNN ensemble, predicts overall and dim-level |
| NISQA | [1,5] | ≈0.90 | Deep CNN-attention, multi-dimension |
| SSL-MOS, UTMOS | [1,5] | ≈0.88–0.93 | Large SSL backbone, fine-tuned on MOS data |
Empirical studies and challenges confirm that modern non-intrusive metrics, when fine-tuned on in-domain enhanced speech, achieve system-level correlations with human judgments comparable to classical intrusive benchmarks (Huang, 1 Aug 2025, Mittag et al., 2021).
3. Learning-Based and Multimodal Assessment Models
Deep learning and large-volume multidimensional MOS datasets have catalyzed a new generation of SQA models with superior correlation, extensibility, and interpretability.
3.1. Architecture Paradigms
- CNN/self-attention hybrid regression: E.g., NISQA (Mittag et al., 2021), CCATMos (Liu et al., 2022)
- Self-supervised foundation model backbones: SSL-MOS, UTMOS, MOSA-Net, S3QA (Zezario et al., 2021, Ogg et al., 2 Jun 2025, Huang, 1 Aug 2025)
- Multitask/mixture-of-experts heads: MOSA-Net (Zezario et al., 2021), MoE approaches (Hu et al., 8 Jul 2025)
- Preference-pairwise frameworks: UPPSQA, UrgentMOS, integrating direct CMOS learning (Shi et al., 2 Jun 2025, Wang et al., 26 Jan 2026)
- Auditory LLMs and multimodal QA: SpeechQualityLLM, SALMONN, Qwen-Audio, QualiSpeech Auditory-LLM, combining audio encoders with LLMs, supporting natural-language output (Monjur et al., 9 Dec 2025, Wang et al., 2024, Wang et al., 26 Mar 2025)
3.2. Core Properties
- Single-ended and double-ended evaluation: Operation with or without references, adaptable to data and deployment context
- Multidimensional output: Joint prediction of MOS and perceptual dimensions
- Natural-language interface: Queryable assessments, explanatory rationales, user-defined listener simulation (Monjur et al., 9 Dec 2025, Wang et al., 26 Mar 2025, Wang et al., 2024)
3.3. Example Model: SpeechQualityLLM
- Front-end: Audio Spectrogram Transformer (AST) or Whisper encoder; log-Mel input, pooled then linearly projected to match LLM hidden size
- Back-end: Llama 3.1-8B-Instruct, LoRA-adapter fine-tuning, predicting templated QA pairs
- Supervision: Template-based QA generation covering overall/dimensional MOS, categorical ratings, rationales
- Evaluation: On NISQA, full-ref AST (finetuned): MAE=0.41, Pearson r=0.86 (MOS) (Monjur et al., 9 Dec 2025)
4. Evaluation Protocols and Benchmarking
The gold standard for calibration and benchmarking remains large-scale subjective listening tests using ITU-T P.800/P.804/P.808 protocols or their crowdsourcing extensions (Naderi et al., 2023, Huang, 1 Aug 2025). Enhanced-speech system development and research typically comprise the following workflow:
- Reference subjective MOS and sub-dimension ratings: Collected via controlled or crowdsourced protocols, e.g., crowd-based P.804 (Naderi et al., 2023)
- Model training and calibration: DNN models trained/fine-tuned on a small curated set of in-domain MOS/dimension annotations for target enhancement artifacts (Huang, 1 Aug 2025, Mittag et al., 2021)
- Quantitative metrics: MAE, RMSE, Pearson’s r, Spearman’s ρ; preference accuracy (for pairwise tasks)
- System and utterance-level evaluation: Emphasizing both bulk statistical correlation and detection of edge-case/artifact failures (Hu et al., 8 Jul 2025)
Recent challenges—including VoiceMOS, ConferencingSpeech, and DNS Challenge—provide standardized data partitions and evaluation protocols, underpinning fair and rigorous assessment of algorithmic progress (Huang, 1 Aug 2025).
5. Integration with Speech Enhancement Pipelines
Modern enhancement evaluation integrates SQA models for both offline and real-time tasks:
- Model selection and bulk QA: Use non-intrusive SQA (e.g., NISQA, SSL-MOS, DNSMOS, UrgentMOS) to score enhanced outputs, replacing or augmenting human listening (Mittag et al., 2021, Wang et al., 26 Jan 2026)
- Diagnostic feedback: Auditory LLMs (SpeechQualityLLM, QualiSpeech, SALMONN) deliver natural-language explanations detailing residual artifacts and their locations, assisting rapid debugging (Monjur et al., 9 Dec 2025, Wang et al., 26 Mar 2025, Wang et al., 2024)
- Training signal integration: SQA models used as differentiable/perceptually calibrated losses to directly guide enhancement optimization, e.g., Quality-Net, MOS-based joint losses, or multi-metric SQA supervision as in UrgentMOS, MOSA-Net (Fu et al., 2019, Nayem et al., 2023, Wang et al., 13 Jun 2025, Zezario et al., 2021)
- Self-supervised reference-free approaches: VQScore uses only clean speech for VQ-VAE codebook construction, quantifying enhancement quality by code-space similarity (Fu et al., 2024)
Such integration typically achieves significant gains in subjective and objective metrics relative to MSE/SI-SDR-only training (Fu et al., 2019, Wang et al., 26 Jan 2026, Nayem et al., 2023).
6. Recent Advances and Open Challenges
Several meta-trends are shaping the current SQA research agenda:
- Preference and ranking-based evaluation: Pairwise and CCR protocols (UPPSQA, UrgentMOS) enable fine-grained benchmarking in scenarios with subtle system differences (Shi et al., 2 Jun 2025, Wang et al., 26 Jan 2026)
- Cross-domain and partially labeled data utilization: Multi-metric frameworks handle heterogeneous datasets with missing metrics via masked multi-task loss (Wang et al., 26 Jan 2026)
- Self-supervised and foundation-model-based methods: Embedding distances (e.g., S3QA) or VQ similarity (VQScore) leverage vast unlabeled data, offering robust reference-free metrics that correlate with MOS, SNR, and WER (Ogg et al., 2 Jun 2025, Fu et al., 2024)
- Explainability and natural-language feedback: Auditory LLMs and prompt-based QA enable accessible and actionable interpretability at scale (Monjur et al., 9 Dec 2025, Wang et al., 26 Mar 2025, Wang et al., 2024)
- Model limitations: Latency and resource cost for large models, need for in-domain calibration, difficulties in rater adaptation, and challenges in achieving utterance-level MOS accuracy remain active areas of investigation (Hu et al., 8 Jul 2025, Wang et al., 26 Jan 2026, Monjur et al., 9 Dec 2025)
The consensus is that combining intrusive, non-intrusive, and multimodal evaluation—anchored to rigorous subjective protocols—delivers the most robust practical and scientific insight.
7. Toolkits, Open Data, and Best Practices
Widespread adoption is facilitated by open-source toolkits and datasets:
| Toolkit | Source/URL | Capabilities |
|---|---|---|
| MOSNet | github.com/haoheliu/MOSNet | BLSTM-based, trainable on MOS |
| DNSMOS | github.com/microsoft/DNS-Challenge | Pre-trained, multi-dim non-intrusive |
| NISQA | github.com/fgnt/nisqa | Multidimensional, trainable, MOS |
| SSL-MOS | github.com/idiap/ssl-mos | SSL backbone, fine-tunable |
| UTMOS | github.com/facebookresearch/utmos | SOTA, robust cross-domain |
| VERSA | github.com/google/versa-eval | 65+ metrics, easy benchmarking |
| QualiSpeech | huggingface.co/datasets/tsinghua-ee/... | 11-aspect annotated, NL QA benchmark |
Best practices: Always curate a small in-domain listening test subset for calibration, fine-tune DNN-based metrics on this set for both absolute and pairwise tasks, and use hybrid pipelines (scalar plus natural-language output) for maximal diagnostic and interpretive power. Establish robust screening, randomization, and gold/trap questions for any new crowdsourced evaluation (Naderi et al., 2023, Wang et al., 26 Mar 2025, Monjur et al., 9 Dec 2025, Huang, 1 Aug 2025).
References:
- SpeechQualityLLM: LLM-Based Multimodal Assessment of Speech Quality (Monjur et al., 9 Dec 2025)
- Universal Preference-Score-based Pairwise Speech Quality Assessment (Shi et al., 2 Jun 2025)
- UrgentMOS: Unified Multi-Metric and Preference Learning for Robust Speech Quality Assessment (Wang et al., 26 Jan 2026)
- Speech Quality Assessment Model Based on Mixture of Experts (Hu et al., 8 Jul 2025)
- Self-Supervised Speech Quality Assessment (S3QA) (Ogg et al., 2 Jun 2025)
- Attention-based Speech Enhancement Using Human Quality Perception Modelling (Nayem et al., 2023)
- Learning with Learned Loss Function: Speech Enhancement with Quality-Net (Fu et al., 2019)
- QualiSpeech: A Speech Quality Assessment Dataset with Natural Language Reasoning and Descriptions (Wang et al., 26 Mar 2025)
- JSQA: Speech Quality Assessment with Perceptually-Inspired Contrastive Pretraining (Fan et al., 15 Jul 2025)
- Residual-Guided Non-Intrusive Speech Quality Assessment (Ye et al., 2022)
- Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model based on BLSTM (Fu et al., 2018)
- CCATMos: Convolutional Context-aware Transformer Network (Liu et al., 2022)
- InQSS: a speech intelligibility and quality assessment model using a multi-task learning network (Chen et al., 2021)
- Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model (Zezario et al., 2021)
- Advancing Speech Quality Assessment Through Scientific Challenges and Open-source Activities (Huang, 1 Aug 2025)
- Enabling Auditory LLMs for Automatic Speech Quality Evaluation (Wang et al., 2024)
- NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction (Mittag et al., 2021)
- Improving Speech Enhancement with Multi-Metric Supervision from Learned Quality Assessment (Wang et al., 13 Jun 2025)
- Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech (Fu et al., 2024)
- Multi-dimensional Speech Quality Assessment in Crowdsourcing (Naderi et al., 2023)