Papers
Topics
Authors
Recent
Search
2000 character limit reached

MedSpeak: Medical ASR Correction

Updated 8 February 2026
  • MedSpeak is a knowledge graph–aided framework that fuses semantic and phonetic relationships to systematically correct automatic speech recognition errors in the medical domain.
  • It employs a five-stage modular pipeline—combining noisy ASR input, KG retrieval, phonetic encoding, and LLM reasoning—to refine transcripts and improve clinical QA outcomes.
  • The framework achieves state-of-the-art performance by significantly reducing word error rates and delivering near-oracle accuracy in specialized medical spoken question-answering scenarios.

MedSpeak is a knowledge graph–aided error correction and question-answering (QA) framework designed to address the persistent failure modes of automatic speech recognition (ASR) in the medical domain. By leveraging a structured medical knowledge graph (KG) with explicit semantic and phonetic relationships, MedSpeak refines noisy ASR transcripts and improves downstream medical spoken QA by incorporating LLM–driven reasoning. The approach systematically resolves domain-specific misrecognitions, particularly for specialized terminology and phonetic confusions, and delivers state-of-the-art accuracy for both ASR correction and clinical multiple-choice QA scenarios (Song et al., 1 Feb 2026).

1. Background and Motivation

Spoken medical QA systems, including LLM-based clinical dialogue agents, rely heavily on ASR outputs as their textual substrate. However, general-purpose ASR models exhibit substantially higher error rates on medical terms compared to non-domain speech, leading to:

  • Substitution errors between clinically distinct entities (e.g., "hypoplasia" misrecognized as "hyperplasia").
  • Phonetic confusion between near-homophones with different semantics ("hypertension" vs. "hypotension").
  • Propagation of recognition errors into downstream QA, resulting in wrong diagnoses or decision support.

Prior attempts at fine-tuning ASR on limited medical audio have yielded only modest improvements, remaining data hungry and incapable of systematically resolving phonetic near-miss errors. Retrieval-augmented pipelines and KG-based techniques typically neglect phonetic ambiguities, instead relying on textual snippets or semantic edges alone, which is insufficient for high-stakes domains with dense, confusable terminology. MedSpeak's innovation is the integration of both semantic and phonetic relationships within a unified KG–LLM correction pipeline, targeting domain-specific ASR error modes (Song et al., 1 Feb 2026).

2. System Architecture

MedSpeak is realized as a modular five-stage pipeline:

  1. Noisy ASR Input: Initial transcript (t^\hat t) produced by a state-of-the-art ASR model (e.g., Whisper) from raw medical audio.
  2. Knowledge Graph Integration: Identification of candidate medical terms in t^\hat t; retrieval of semantic (Ksem\mathcal K_{\mathrm{sem}}) and phonetic (Kphon\mathcal K_{\mathrm{phon}}) subgraphs for correction.
  3. Phonetic Encoder: Embedding of both original and candidate tokens via a phonetic encoder, parameterized to capture spelling–pronunciation correspondences (e.g., Double Metaphone+CMUdict lexicon).
  4. LLM-Based Reasoning: Concatenation of system instructions, user block (noisy transcript, options, retrieved KG context) as prompt; input to a fine-tuned LLaMA-derived LLM.
  5. Final Prediction: Joint output of the corrected transcript (t~\tilde t) and QA answer (oo^*) by the LLM.

Pipeline summary: Audio → ASR → Noisy Transcript → KG Retrieval → Encoder + Scoring → Candidate Reranking → Corrected Transcript → LLM (with options + KG) → Answer & Final Transcript (Song et al., 1 Feb 2026).

3. Medical Knowledge Graph Representation

The MedSpeak KG is formally defined as G=(V,E)G = (V, E):

  • VV: medical concepts (e.g., "hypoplasia," "cerebral atrophy").
  • EE: edges typed by relation ("classifies", "due_to", "phonetic").

Each node vVv \in V is assigned an embedding hvRdh_v \in \mathbb{R}^d (pretrained via translational KG objectives such as TransE or GAT over UMLS). Edges are likewise embedded—either as learned matrices WrRd×dW_r \in \mathbb{R}^{d\times d} by relation rr, or via vector representations. For semantic proximity of terms wi,wjw_i, w_j within KG, MedSpeak defines: ssem(wi,wj)=σ(hwiTWrhwj)s_{sem}(w_i, w_j) = \sigma\bigl(h_{w_i}^T W_r h_{w_j}\bigr) where σ()\sigma(\cdot) is the sigmoid and WrW_r is relation-specific (Song et al., 1 Feb 2026).

4. Phonetic Error Correction via KG

A distinguishing component of MedSpeak is the explicit modeling of phonetic confusability. Each word ww and candidate term tt are mapped to phonetic embeddings by fphonf_{phon} (e.g., Double Metaphone features, encoded as vectors): dphon(w,t)=fphon(w)fphon(t)2d_{phon}(w, t) = \| f_{phon}(w) - f_{phon}(t) \|_2 Candidates are ranked for possible correction in terms of a joint score: s(w,t)=λssem(w,t)+(1λ)(1dphon(w,t))s(w, t) = \lambda s_{sem}(w, t) + (1 - \lambda)(1 - d_{phon}(w, t)) with λ[0,1]\lambda \in [0, 1] controlling the semantic vs. phonetic tradeoff. This formulation ensures prioritization of candidates that are both semantically proximate and phonetically plausible, outperforming prior approaches that neglect either axis (Song et al., 1 Feb 2026).

5. LLM Integration and Reasoning for Spoken QA

The LLM module is explicitly prompted to generate corrected transcripts alongside multiple-choice answers, using the following protocol:

  • System instruction: "You must output exactly two lines: Corrected Text: … Correct Option: A/B/C/D."
  • User context: comprises the noisy ASR output, choice set {A,B,C,D}\{A, B, C, D\}, and the truncated KG context (containing candidate nodes and edges derived as above).

The LLM (LLaMA-derived, with in-domain fine-tuning) receives this structured prompt, enabling joint editing of the ASR transcript and downstream QA by leveraging both in-context semantic/phonetic cues and the candidate answer set (Song et al., 1 Feb 2026).

Training is end-to-end, with a causal language modeling objective: L(θ)=i=1ylogPθ(yix,y<i)\mathcal{L}(\theta) = -\sum_{i=1}^{|y|} \log P_\theta(y_i | x, y_{<i}) where yy is the tokenized, two-line gold output, and xx is the concatenated instruction and user input, thus supervising both transcript correction and QA reasoning (Song et al., 1 Feb 2026).

6. Evaluation and Benchmarks

Experiments are conducted on 47 hours of TTS-synthesized speech from MMLU Medical, MedQA (USMLE-based), and MedMCQA. Key metrics:

Model QA Accuracy (%) WER (%)
ZS-ASR 50.2 77.2
FT+Whisp 83.7 35.8
MedSpeak 93.4 29.9
FT-LLM (Oracle GT) 92.5

MedSpeak achieves nearly oracle-level QA accuracy (93.4% vs 92.5% for ground-truth LLM), despite being fed noisy ASR. The phonetic+semantic KG correction yields 5–7 percentage point WER reduction in challenging domains. The framework surpasses prior best systems that use only fine-tuned LLMs or retrieval-augmented prompts without explicit KG or phonetic integration (Song et al., 1 Feb 2026).

7. Limitations and Future Directions

Analysis highlights residual errors on ultra-rare medical terms absent from the KG and long-tail phonetic errors. Notable limitations:

  • Static KG: fails to capture emerging or novel clinical terminology (e.g., new drugs).
  • KG truncation: constrained by input window, limiting the breadth of context the LLM can exploit.

Promising extensions include:

  • Continual KG updates from real-time literature scraping.
  • Joint training of ASR and KG representation layers.
  • Dynamic (learnable) KG subgraph selection within a strict token budget.
  • Multimodal augmentation (histopathology, imaging, labs) for further error correction and reasoning (Song et al., 1 Feb 2026).

8. Significance and Impact

MedSpeak constitutes the first framework that jointly exploits a structured medical KG (encoding both semantic and phonetic associations) and a fine-tuned LLM for spoken QA error correction. This system sets a new state-of-the-art in both word error rate and medical QA accuracy for the domain, with a workflow and modular design adaptable to other high-risk speech-driven verticals requiring extreme robustness to domain-specific ASR failures (Song et al., 1 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MedSpeak.