Test-Time Speech ICL (SICL)

Updated 3 February 2026

Test-Time Speech-based In-Context Learning is a method that adapts speech models at inference using inline speech–label examples, eliminating the need for gradient updates.
It leverages a prompt-conditioning paradigm from large language models, enabling diverse architectures to support ASR, SLU, and speech translation tasks.
Empirical results demonstrate significant word error rate reductions and robust performance across varied domains using techniques like TICL, acoustic reranking, and Bayesian selection.

Test-Time Speech-based In-Context Learning (SICL) enables speech and speech-text foundation models to adapt to new or low-resource speech recognition and spoken language understanding tasks at inference, using a small number of inline speech–label demonstrations (“in-context examples”) rather than gradient-based updates. SICL leverages the prompt-conditioning paradigm originating in LLMs, extending it to speech inputs and outputs. The SICL protocol has been instantiated in a range of architectural and methodological variants, supporting robust, training-free adaptation for automatic speech recognition (ASR), spoken language understanding (SLU), and speech translation across diverse domains and conditions (Agrawal et al., 12 May 2025, Wang et al., 2023, Roll et al., 20 May 2025, Zheng et al., 16 Sep 2025, Zheng et al., 20 Dec 2025, Everson et al., 2024, Wang et al., 2024, Pan et al., 2023, Chen et al., 2023, Zheng et al., 26 Jan 2026, Yen et al., 2024).

1. Formalization and Core Mechanism

SICL assumes a pretrained speech-text or multimodal model $\Lambda$ whose inputs can encode sequences of audio tokens (from audio encoders) and/or text. At inference, given a test utterance $X$ and a context set $C = \{(X_1, y_1), ..., (X_K, y_K)\}$ (where $X_i$ is audio and $y_i$ its label/transcript), SICL conditions the model on $C$ and $X$ and decodes the target output $Y$ : $\hat{Y} = \arg\max_Y P(Y \mid C, X, \Lambda)$ The context set $C$ is prepended or otherwise incorporated into the encoder (for audio) and decoder (for text), with varying prompt formats depending on the backbone model. SICL does not update model parameters at test time. Instead, it relies on the model's emergent ability to use the demonstration examples as conditioning, biasing predictions toward the target domain's linguistic or acoustic properties.

For example, in Whisper-v3 and Phi-4-MM, context demonstration audio is concatenated onto the encoder input as multiple utterances and their transcripts are added to the decoder's prompt tokens (Wang et al., 2023, Zheng et al., 16 Sep 2025). In hybrid speech-text transformer models (e.g., SALMONN-13B, COSMIC), the speech encoder and cross-modal “Q-former” or adapters project audio into the language decoder's attention space, letting the autoregressive decoder access both the test query and the demonstration set (Agrawal et al., 12 May 2025, Pan et al., 2023).

2. Architectures and Prompt Engineering

SICL Instantiations:

Audio-Text Multimodal Transformers: SALMONN-13B (Agrawal et al., 12 May 2025), COSMIC (Pan et al., 2023), and similar models employ a speech encoder (e.g., Conv+Transformer or CTC-based), a modality adapter (e.g., Q-former), and a LLaMA/decoder backbone. At inference, the query audio and context demonstrations (speech and, possibly, text form) are inserted using special prompt tokens. The decoder performs cross-attention over both modalities.
End-to-End ASR Models: Encoder-decoder models such as Whisper (Wang et al., 2023) and SICL-AED (Yen et al., 2024) support SICL by concatenating context utterances at the encoder input and context transcripts at the decoder input, without architectural modification.
Model Prompt Construction:
- Context examples are formatted either as full audio–transcript pairs, transcript-only demonstrations, or (in retrieval-based variants) a sequence of semantic “nearest-neighbor” pairs. Practical pipelines frequently incorporate embedding-based selection for the most relevant demonstrations (Zheng et al., 16 Sep 2025, Zheng et al., 20 Dec 2025, Wang et al., 2024).
- Prompt structure in models such as Phi-4-MM interleaves user/assistant role tokens, audio markers, transcript lines, and task-specific instructions (Roll et al., 20 May 2025).

Representative Prompt Template (Phi-4-MM, Test-Time ICL) (Roll et al., 20 May 2025):

$X$ 3

3. Example Selection and Retrieval Strategies

The quality of in-context examples is a primary determinant of adaptation efficacy. Several pipelines have been developed to select highly relevant demonstrations:

Text-Embedding KNN (TICL): Uses a frozen sentence encoder (e.g., paraphrase-mpnet models) to embed pseudo-transcripts of candidate pool utterances and retrieves the $X$ 0 nearest examples by Euclidean or cosine distance to the test utterance's pseudo-transcript embedding (Zheng et al., 16 Sep 2025). This approach is robust to pseudo-label noise and outperforms speaker- or acoustic-content-based retrieval across ASR and multilingual tasks.
Acoustic Reranking (TICL+): Augments semantic retrieval with acoustic similarity. After an initial semantic filter, candidates are reranked by distance in the embedding space of a frozen speech encoder (e.g., Whisper). This enhances performance for high-variability speech (e.g., children’s speech in MyST, OGI) (Zheng et al., 20 Dec 2025).
Bayesian Example Selection (ByCS): Implements an inverse inference scoring—each candidate is evaluated by how well, when treated as a test example with the current test utterance as the demonstration, its label can be recovered by the model. The candidates with highest mutual information (measured via e.g., Jaccard overlap) are selected (Wang et al., 2024).
Active and kNN-based Selection in Whisper-SICL: Frame-level mean-pooled audio embeddings are used to compute kNN distances for selecting context examples, sometimes further filtered by dialect or speaker (Wang et al., 2023).

Method	Selection Modalities	Empirical Gains
TICL	Textual semantic similarity	Up to 84.7% rel. WER reduction (Zheng et al., 16 Sep 2025)
TICL+	Text + Acoustic similarity	Up to 53.3% rel. WER reduction (Zheng et al., 20 Dec 2025)
ByCS	Inverse inference likelihood	~10% absolute WER reduction (Wang et al., 2024)
Whisper kNN	Audio embedding similarity	32.3–36.4% rel. WER reduction (Wang et al., 2023)

4. Empirical Results and Task Coverage

SICL demonstrates strong, monotonic performance improvements with an increasing number and quality of in-context examples, across a diverse range of tasks:

Automatic Speech Recognition (ASR): On English-accented (GLOBE-V2), children’s (MyST, OGI Kids, RSR), and multilingual (CommonVoice, L2-Arctic) datasets, SICL with well-chosen demonstrations yields large relative WER reductions (25–85% compared to zero-shot), with effects saturating at $X$ 1 (Zheng et al., 16 Sep 2025, Zheng et al., 20 Dec 2025, Roll et al., 20 May 2025).
Spoken Language Understanding (SLU): In cross-task and unseen-task settings (e.g., train on dialogue sentiment—test on dialog acts), robust fine-tuning with randomized class labels achieves up to 95.3% relative macro-F1 improvement in mismatched few-shot SLU (Agrawal et al., 12 May 2025).
Children’s Speech: TICL+ produces up to 53.3% relative WER reduction on challenging noisy children’s speech datasets relative to zero-shot; acoustic reranking is particularly effective when transcript-based retrieval alone is weak (Zheng et al., 20 Dec 2025).
Long-Form Decoding and Speaker Adaptation: SICL-AED yields an 8.6% relative WER reduction over utterance-level baselines, and with in-context fine-tuning achieves entity recall gains of 64% for contextual biasing (Yen et al., 2024).
Extensible Modalities and Tasks: SICL is effective for cross-domain adaptation in SLU, entity recognition, intent classification, speech-to-text translation, and contextual keyword boosting (Pan et al., 2023, Chen et al., 2023, Everson et al., 2024).

5. Fine-Tuning Strategies for Robust SICL

While vanilla SICL is training-free at inference, certain techniques can prime models for downstream generalization:

Random Label Fine-Tuning: During supervised fine-tuning, class label definitions are permuted at every minibatch, forcing the model to “read” class semantics rather than rely on idiosyncratic label–definition mappings. At test time, the real class definitions are restored. In unseen-task transfer, this method supersedes matched FT or symbol-mapped FT in both zero- and few-shot SLU, delivering +95.3% and +64.3% mean relative F1 gains respectively on dialogue act and NER transfer (Agrawal et al., 12 May 2025).
Post-Training SICL Adaptation (SICL-AT): Trains LoRA adapters episodically on high-resource ICL-formatted speech tasks only, then freezes all base weights for actual low-resource or out-of-domain evaluation. SICL-AT generalizes more effectively than direct fine-tuning, yielding consistent WER and accuracy gains in children’s ASR and speech reasoning, as well as in informative BLEU improvements for speech translation (Zheng et al., 26 Jan 2026).
Speech-Supervised In-Context Training (Speech ICT, SALM): In multitask sequence-to-sequence models, a fraction of training examples are interleaved with text keyword "hints"—instructing the model to extract these on top of ASR/AST outputs, thereby teaching in-context text conditioning at inference (Chen et al., 2023).

6. Limitations and Open Challenges

SICL empirically enables robust, training-free adaptation in many settings, but its limitations and open research questions are numerous:

Zero-Shot Upper-Bounds: In challenging unseen-task SLU transfer, mismatched zero-shot performance remains far below the matched upper-bound (e.g., macro-F1 ~1–3% vs. ~38–61%) (Agrawal et al., 12 May 2025).
Rare and Out-of-Vocabulary Classes: Rare labels, technical terms, and code-switched or highly inflected morphological forms may not be recovered well, especially when semantic retrieval is misled due to pseudo-label noise or embedding drift (Zheng et al., 16 Sep 2025, Zheng et al., 20 Dec 2025).
Context Scaling and Diminishing Returns: Performance improvements plateau or even degrade for $X$ 2; longer multimodal prompts can exhaust context windows and reduce effective modeling (Zheng et al., 16 Sep 2025, Roll et al., 20 May 2025).
Quality Sensitivity in Retrieval: Poorly chosen or noisy demonstrations can reduce SICL gains; active or learned selection (e.g., by ByCS) is under active investigation (Wang et al., 2024).
Streaming and Partial-Label Adaptation: SICL, as commonly instantiated, requires full context assembly and labeled demonstrations at inference, limiting its utility in streaming, partial-supervision, or real-time settings (Roll et al., 20 May 2025).
Limited Model Coverage: Most results pertain to a few leading models (Whisper, Phi-4-MM, SALMONN/LLAMA-family, Qwen2-Audio). Transferability to other architectures, particularly beyond speech-text, is underexplored.

7. Future Directions

Emergent research avenues for SICL include:

Active and Adaptive Example Selection: Jointly optimizing pseudo-labeling, multi-modal retrieval, and weighting (potentially via a learned scoring function or end-to-end training) (Zheng et al., 20 Dec 2025, Wang et al., 2024).
Instruction-Tuned and Adapter-Based Architectures: Embedding in-context learning priors deeper into speech/text foundation models, leveraging instruction-tuning and modular adaptation (Pan et al., 2023, Zheng et al., 26 Jan 2026).
Cross-Lingual and Code-Switched Adaptation: Extending SICL to handle spontaneous, highly variable, and multi-lingual conversational speech, with zero/few-shot supervision (Roll et al., 20 May 2025, Zheng et al., 16 Sep 2025).
Integration of Error Lattices and Uncertainty: Feeding word confusion networks from ASR lattices into LLM prompts yields modest but robust SLU gains, suggesting further work leveraging structured speech ambiguities (Everson et al., 2024).
Extending to Generative and Reasoning Tasks: Moving beyond recognition/classification to open-ended spoken QA, summarization, and dialog act synthesis (Agrawal et al., 12 May 2025).
Efficient Long-Form and Memory-Augmented Decoding: Document-level and utterance-level attention architectures reduce computational cost and memory requirements while leveraging extended context for both adaptation and entity recall (Yen et al., 2024).