MedSpeak: Medical ASR Correction
- MedSpeak is a knowledge graph–aided framework that fuses semantic and phonetic relationships to systematically correct automatic speech recognition errors in the medical domain.
- It employs a five-stage modular pipeline—combining noisy ASR input, KG retrieval, phonetic encoding, and LLM reasoning—to refine transcripts and improve clinical QA outcomes.
- The framework achieves state-of-the-art performance by significantly reducing word error rates and delivering near-oracle accuracy in specialized medical spoken question-answering scenarios.
MedSpeak is a knowledge graph–aided error correction and question-answering (QA) framework designed to address the persistent failure modes of automatic speech recognition (ASR) in the medical domain. By leveraging a structured medical knowledge graph (KG) with explicit semantic and phonetic relationships, MedSpeak refines noisy ASR transcripts and improves downstream medical spoken QA by incorporating LLM–driven reasoning. The approach systematically resolves domain-specific misrecognitions, particularly for specialized terminology and phonetic confusions, and delivers state-of-the-art accuracy for both ASR correction and clinical multiple-choice QA scenarios (Song et al., 1 Feb 2026).
1. Background and Motivation
Spoken medical QA systems, including LLM-based clinical dialogue agents, rely heavily on ASR outputs as their textual substrate. However, general-purpose ASR models exhibit substantially higher error rates on medical terms compared to non-domain speech, leading to:
- Substitution errors between clinically distinct entities (e.g., "hypoplasia" misrecognized as "hyperplasia").
- Phonetic confusion between near-homophones with different semantics ("hypertension" vs. "hypotension").
- Propagation of recognition errors into downstream QA, resulting in wrong diagnoses or decision support.
Prior attempts at fine-tuning ASR on limited medical audio have yielded only modest improvements, remaining data hungry and incapable of systematically resolving phonetic near-miss errors. Retrieval-augmented pipelines and KG-based techniques typically neglect phonetic ambiguities, instead relying on textual snippets or semantic edges alone, which is insufficient for high-stakes domains with dense, confusable terminology. MedSpeak's innovation is the integration of both semantic and phonetic relationships within a unified KG–LLM correction pipeline, targeting domain-specific ASR error modes (Song et al., 1 Feb 2026).
2. System Architecture
MedSpeak is realized as a modular five-stage pipeline:
- Noisy ASR Input: Initial transcript () produced by a state-of-the-art ASR model (e.g., Whisper) from raw medical audio.
- Knowledge Graph Integration: Identification of candidate medical terms in ; retrieval of semantic () and phonetic () subgraphs for correction.
- Phonetic Encoder: Embedding of both original and candidate tokens via a phonetic encoder, parameterized to capture spelling–pronunciation correspondences (e.g., Double Metaphone+CMUdict lexicon).
- LLM-Based Reasoning: Concatenation of system instructions, user block (noisy transcript, options, retrieved KG context) as prompt; input to a fine-tuned LLaMA-derived LLM.
- Final Prediction: Joint output of the corrected transcript () and QA answer () by the LLM.
Pipeline summary: Audio → ASR → Noisy Transcript → KG Retrieval → Encoder + Scoring → Candidate Reranking → Corrected Transcript → LLM (with options + KG) → Answer & Final Transcript (Song et al., 1 Feb 2026).
3. Medical Knowledge Graph Representation
The MedSpeak KG is formally defined as :
- : medical concepts (e.g., "hypoplasia," "cerebral atrophy").
- : edges typed by relation ("classifies", "due_to", "phonetic").
Each node is assigned an embedding (pretrained via translational KG objectives such as TransE or GAT over UMLS). Edges are likewise embedded—either as learned matrices by relation , or via vector representations. For semantic proximity of terms within KG, MedSpeak defines: where is the sigmoid and is relation-specific (Song et al., 1 Feb 2026).
4. Phonetic Error Correction via KG
A distinguishing component of MedSpeak is the explicit modeling of phonetic confusability. Each word and candidate term are mapped to phonetic embeddings by (e.g., Double Metaphone features, encoded as vectors): Candidates are ranked for possible correction in terms of a joint score: with controlling the semantic vs. phonetic tradeoff. This formulation ensures prioritization of candidates that are both semantically proximate and phonetically plausible, outperforming prior approaches that neglect either axis (Song et al., 1 Feb 2026).
5. LLM Integration and Reasoning for Spoken QA
The LLM module is explicitly prompted to generate corrected transcripts alongside multiple-choice answers, using the following protocol:
- System instruction: "You must output exactly two lines: Corrected Text: … Correct Option: A/B/C/D."
- User context: comprises the noisy ASR output, choice set , and the truncated KG context (containing candidate nodes and edges derived as above).
The LLM (LLaMA-derived, with in-domain fine-tuning) receives this structured prompt, enabling joint editing of the ASR transcript and downstream QA by leveraging both in-context semantic/phonetic cues and the candidate answer set (Song et al., 1 Feb 2026).
Training is end-to-end, with a causal language modeling objective: where is the tokenized, two-line gold output, and is the concatenated instruction and user input, thus supervising both transcript correction and QA reasoning (Song et al., 1 Feb 2026).
6. Evaluation and Benchmarks
Experiments are conducted on 47 hours of TTS-synthesized speech from MMLU Medical, MedQA (USMLE-based), and MedMCQA. Key metrics:
| Model | QA Accuracy (%) | WER (%) |
|---|---|---|
| ZS-ASR | 50.2 | 77.2 |
| FT+Whisp | 83.7 | 35.8 |
| MedSpeak | 93.4 | 29.9 |
| FT-LLM (Oracle GT) | 92.5 | — |
MedSpeak achieves nearly oracle-level QA accuracy (93.4% vs 92.5% for ground-truth LLM), despite being fed noisy ASR. The phonetic+semantic KG correction yields 5–7 percentage point WER reduction in challenging domains. The framework surpasses prior best systems that use only fine-tuned LLMs or retrieval-augmented prompts without explicit KG or phonetic integration (Song et al., 1 Feb 2026).
7. Limitations and Future Directions
Analysis highlights residual errors on ultra-rare medical terms absent from the KG and long-tail phonetic errors. Notable limitations:
- Static KG: fails to capture emerging or novel clinical terminology (e.g., new drugs).
- KG truncation: constrained by input window, limiting the breadth of context the LLM can exploit.
Promising extensions include:
- Continual KG updates from real-time literature scraping.
- Joint training of ASR and KG representation layers.
- Dynamic (learnable) KG subgraph selection within a strict token budget.
- Multimodal augmentation (histopathology, imaging, labs) for further error correction and reasoning (Song et al., 1 Feb 2026).
8. Significance and Impact
MedSpeak constitutes the first framework that jointly exploits a structured medical KG (encoding both semantic and phonetic associations) and a fine-tuned LLM for spoken QA error correction. This system sets a new state-of-the-art in both word error rate and medical QA accuracy for the domain, with a workflow and modular design adaptable to other high-risk speech-driven verticals requiring extreme robustness to domain-specific ASR failures (Song et al., 1 Feb 2026).