TICL: Text-Embedding KNN for SICL
- The paper introduces retrieval-augmented adaptation by using text-embedding KNN to select semantically relevant in-context demonstrations, leading to substantial ASR improvements.
- TICL applies a four-stage semantic retrieval process that reduces relative WER by up to 84.7% across diverse tasks by precisely matching lexical content.
- The method leverages pseudo-labeling, robust text encoders, and context assembly for plug-and-play, zero-shot adaptation without requiring any model fine-tuning.
Text-Embedding KNN for SICL (TICL) is a retrieval-augmented adaptation strategy for Speech In-Context Learning (SICL) in large multimodal models. SICL generalizes the in-context learning paradigm from text to speech by conditioning inference not only on test audio but also on a sequence of demonstration pairs, each consisting of a reference speech utterance and its transcript. TICL provides a lightweight mechanism to select in-context examples based on their lexical proximity to the (unknown) target transcript, thereby improving the robustness and domain adaptation of ASR performance in zero- or few-shot settings (Zheng et al., 16 Sep 2025).
1. Speech In-Context Learning and Demonstration Selection
SICL for automatic speech recognition in large multimodal models is formalized as conditional generation. Given a frozen model Λ capable of ingesting audio and text, and a test utterance , the goal is to generate the transcript
where is the encoded representation of and is a context of demonstration turns. Each may include a text prompt (possibly empty), an audio encoding , and a gold transcription .
In SICL, model parameters are frozen at inference; all adaptation arises from context selection. The demonstration pool quality is critical: semantically irrelevant or lexically distant examples can degrade performance. Empirical studies reveal that random selection leads to performance volatility, motivating retrieval-based selection (Zheng et al., 16 Sep 2025).
2. TICL Pipeline: Four-Stage Semantic Retrieval
TICL selects in-context demonstrations through a principled four-step process designed to approximate lexical proximity to the unknown test target:
- Pseudo-labeling: A frozen ASR model (e.g., Whisper-Large-v3-turbo) generates a noisy "pseudo-transcript" for the test utterance. This provides a candidate surrogate for the unknown gold transcript.
- Text Embedding: The pseudo-label is embedded using a pretrained sentence encoder (e.g., all-mpnet-base-v2 for English, or paraphrase-multilingual-mpnet-base-v2 in multilingual scenarios). Denote , . Embeddings are -normalized:
- K-Nearest Neighbor Retrieval: Compute Euclidean distance in normalized embedding space:
Select the nearest demonstration indices:
This prioritizes context transcripts that are lexically close to the pseudo-label.
- Context Assembly and Decoding: The corresponding speech–text pairs for are retrieved. Their tokenized representations are concatenated (alongside if present) and prepended to the test audio tokens to form the final input sequence for Λ, which then autoregressively decodes the output transcript.
This approach directly activates the latent modeling capacity of Λ for the domain at hand, without any fine-tuning or weight updates (Zheng et al., 16 Sep 2025).
3. Quantitative Outcomes and Empirical Robustness
TICL yields substantial improvements over zero-shot and random-context SICL baselines across diverse ASR tasks:
| Task | Model | K | Baseline WER | WER (TICL) | Relative WER ↓ |
|---|---|---|---|---|---|
| L2-Arctic (Accent) | Qwen2-Audio | 4 | 11.06% | 1.41% | 84.7% |
| GLOBE-V2 (Accented) | Multiple | 4 | varies | varies | up to 84.7% |
| CommonVoice (Multilingual) | Multiple | 4 | varies | varies | up to 84.6% |
| ENNI (Children speech) | Multiple | 4 | varies | varies | 5.8–47.3% |
WER is computed as
with = substitutions, = deletions, = insertions, = reference word count. Relative reduction is
where denotes the zero-shot (K=0) condition (Zheng et al., 16 Sep 2025).
Ablation studies attribute robustness to two axes:
- Pseudo-labeler quality: Even low-fidelity pseudo-labelers (e.g., Whisper-tiny with WER ≈ 13.1%) enable large relative WER reductions (67.8%). Higher-quality pseudo-labels yield diminishing returns, reflecting the resilience of semantic retrieval to moderate transcription noise.
- Number of demonstrations (K): Most benefit is achieved by ; larger often degrades performance due to context-length and diminishing relevance constraints.
4. Limitations and Failure Modes
TICL's reliance on embedding space similarity may be misled by rare or compound terms where both pseudo-labels and pretrained sentence encoders are error-prone. For example, when the test utterance contains words unseen or infrequent in the pseudo-labeler training data, semantic retrieval can return lexically "related" but pragmatically irrelevant exemplars. As a result, performance gains may be attenuated for such edge cases (Zheng et al., 16 Sep 2025).
The method does not consider acoustic similarity, which can be critical in domains with high variance (e.g., children's speech, pathological/disordered speech, or noisy environments). In such cases, purely lexical retrieval can pull in context examples with mismatched speaker or channel properties, blunting the in-context adaptation effect.
5. Relation to Weighted and Multi-Stage Retrieval
Extensions such as TICL+ integrate an acoustic reranking stage after the semantic KNN selection. The resulting two-stage pipeline further narrows the candidate set with acoustic similarity (e.g., via embeddings from a frozen speech encoder), prioritizing demonstration examples that are both lexically proximate and sound similar to the test utterance. Empirically, this approach achieves up to 53.3% additional relative WER reduction over zero-shot and up to 37.6% over baseline TICL in challenging children's speech ASR (Zheng et al., 20 Dec 2025). Multi-stage retrieval mitigates the error propagation from pseudo-labeling and ensures domain-appropriate context alignment.
6. Practical Recommendations and Outlook
TICL demonstrates powerful, plug-and-play adaptation for off-the-shelf ASR-capable large multimodal models. Required components are a frozen speech model, a text encoder, and a pseudo-labeler—no model fine-tuning is necessary at inference. For best results:
- Use robust, multilingual text encoders and ASR pseudo-labelers;
- Restrict to minimize prompt-length issues (typically );
- In domains with high acoustic heterogeneity, combine semantic and acoustic KNN (cf. TICL+).
Ongoing work explores subword-level retrieval, dynamic pseudo-label selection, and deeper analysis of the model's attention mechanisms during in-context adaptation (Zheng et al., 16 Sep 2025). For scenarios such as speech domain transfer, multi-lingual adaptation, and child or accented ASR, embedding-based KNN retrieval markedly lowers the barrier to effective test-time adaptation in the speech-text joint space.