In-Context Distillation (ICD) Overview

Updated 8 February 2026

In-Context Distillation (ICD) is a method that extends traditional knowledge distillation by capturing not only per-sample predictions but also the contextual relationships among samples.
ICD employs techniques like neighborhood retrieval in vision, prompt internalization in language models, and online approaches to transfer few-shot learning and contextual competence from large teachers to smaller students.
Empirical results demonstrate ICD’s benefits across modalities with improved metrics and theoretical guarantees, while highlighting trade-offs such as dependency on high-quality teacher context and careful demonstration selection.

In-Context Distillation (ICD) is a family of methods that extend standard knowledge distillation by internalizing not just samplewise predictions, but also the informative structure and behavioral gains afforded by context, demonstrations, or relational neighborhoods, so that models (often small or memory-constrained) can operate without explicit in-context exemplars at inference while capturing the teacher’s contextual abilities. ICD principles have been deployed for classification, language modeling, vision-language, tabular tasks, and especially for transferring few-shot and in-context learning competence from large models to smaller ones.

1. Core Concepts and Theoretical Foundations

Classical knowledge distillation (KD) regularizes the student model by aligning its prediction to the teacher’s output for each input sample (samplewise KD), typically minimizing a loss between the teacher’s softmax logits $p^t(k)$ and the student’s predictions $p^s(k)$ . However, this approach neglects relationships among samples and the benefits conferred by grouped exemplars (e.g., context demonstrations or similar-neighborhoods).

ICD generalizes KD by systematically capturing and internalizing knowledge over in-context groupings. This is grounded in the observation that KD is equivalent to learned label smoothing (LSR):

$\mathcal{L}_{\mathrm{kd}} = (1-\alpha) H(y, p^s) + \alpha H(p^t, p^s)$

which can be viewed as minimizing the cross-entropy with a smoothed target: $\hat{q}^t(k) = (1-\alpha) y(k) + \alpha p^t(k)$ ICD extends this by aggregating $p^t$ over retrieved in-context neighbors, providing a richer, data-dependent smoothing target. Theoretical analyses formalize context aggregation as a regularization that captures inter-class and intra-class correlations, yielding Rademacher complexity-based generalization bounds and explicit bias–MMD (maximum mean discrepancy)-dependent performance guarantees for context selection (Zhu et al., 13 Jan 2025, Li et al., 13 Jun 2025).

2. ICD Methodologies Across Modalities

ICD methodologies vary across domains but share essential steps:

Neighborhood/Retrieval-Based ICD: In computer vision, representations from a teacher’s memory bank are retrieved based on similarity (cosine or otherwise), aggregated and used as the soft label for distillation. The IC-KD framework formalizes this, constructing a feature memory bank, retrieving positive (same-class) and negative (different-class) neighborhoods, and applying class-conditional aggregation for positive in-context distillation (PICD) and contrastive losses for negative in-context distillation (NICD) (Zhu et al., 13 Jan 2025).
Prompt or Demonstration Internalization in LLMs: Large LLMs often require in-context examples for few-shot learning (ICL). ICD synthesizes these context effects into the student’s parameters. During training, the teacher is fed a prompt with support examples and a query, while the student is trained to predict the output on the query alone, using KL divergence or cross-entropy between teacher and student outputs (Snell et al., 2022, Upadhayayaya et al., 2024, Duan et al., 2024, Huang et al., 2022). More advanced strategies combine in-context objectives and language modeling losses, and use both hard and soft target supervision.
Online and Efficient ICD: Recent work proposes online retrieval-augmented ICD, especially for resource-constrained settings (e.g., small vision-LLMs), combining demonstration retrieval, uncertainty-based teacher querying, and iterative demonstration pool augmentation at test time, all without explicit retraining of the student model (Kang et al., 20 Oct 2025).

Variant	Retrieval/Context Mechanism	Loss/Objective
IC-KD (Vision)	Feature memory, top-K/N neighbors	KL for positives (PICD), cosine-contrast for negatives (NICD)
LLM ICD (NLP)	Prompt context in teacher	KL (teacher in-context vs. student query-only), sometimes CE
Tabular ICD	Optimized distilled data subset	-log likelihood over training set via TabPFN context

3. Distillation Algorithmic Frameworks

Build a complete teacher feature memory bank.
For each query $x_i$ , form a positive set (top-K by similarity and label match), and a negative set (top-N by similarity and label mismatch).
Aggregate teacher softmax outputs in positive set into $\hat{p}_i^+$ .
Losses:
- PICD: $\mathcal{L}_{\mathrm{picd}} = D_{\mathrm{KL}}(\hat{p}_i^+ \,\|\, p_i^s)$
- NICD: cosine-based contrast with negatives.
- Final: weighted sum with standard KD and CE.

Sample a batch of queries; for each, select k-shot support examples.
Teacher: runs on (support + query).
Student: runs on query alone (or minimally prefixed).
Minimize combined KL (teacher vs. student, over soft labels) and CE (student vs. hard label), with careful weighting.
Learning objective often interpolates between hard label and distillation, possibly with temperature.

4. Technical Results and Empirical Impact

NLP/LLM:

ICD on OPT-125M achieves up to 50% higher out-of-domain accuracy vs. in-context learning or pattern-based fine-tuning, with 60% reduced memory (Duan et al., 2024).
ICD matches or exceeds vanilla ICL in-domain accuracy, and consistently outperforms ICL out-of-domain, though full few-shot fine-tuning remains superior with large datasets (Upadhayayaya et al., 2024).
Context distillation outperforms direct gradient-based fine-tuning (9pp gain in SPIDER Text-to-SQL adaptation) and allows training to internalize more context than fits in the window (Snell et al., 2022).
Reasoning distillation (ReDis) combines rule generation and rule following, surpassing even the teacher (e.g., +23.2% relative gain over GPT-4o on 1D-ARC), and accelerating hypothesis search (Sadeq et al., 14 Apr 2025).

5. Theoretical Guarantees and Best Practices

Generalization Bounds: ICD generalization error is controlled by the empirical loss on the prompt demonstrations and a Rademacher complexity term scaling as $O(1/\sqrt{N})$ , with constants depending on norm and feature bounds. More demonstrations reduce variance (Li et al., 13 Jun 2025).
Bias–MMD Relationship: The student’s bias after distillation is linearly bounded by the MMD between the prompt (demonstrations) and the target/query distribution, indicating that careful prompt selection (minimizing MMD) reduces bias (Li et al., 13 Jun 2025).
Prompt Engineering Implications: MMD should be minimized for demonstration retrieval; more demos N reduce generalization error. Feature normalization and prompt "sharpness" can mediate the trade-off between learning strength and sensitivity.

6. Limitations, Trade-Offs, and Future Directions

Limitations:

Student performance is bounded by teacher contextual accuracy. Poor teacher context or demonstration selection can propagate errors (Snell et al., 2022, Li et al., 13 Jun 2025).
For large students, ICD performance can stagnate or degrade, suggesting architectural challenges in internalization at scale (Upadhayayaya et al., 2024).
All variants depend on having access to a teacher with strong in-context ability and, often, sufficient compute for teacher runs.

Trade-offs:

ICD dramatically reduces inference memory and compute versus prompt-based or demonstration-heavy methods, at the cost of a distillation phase.
Empirically, annotation-efficient online ICD can close 90% of the teacher gap at 1/6th annotation cost, but its effectiveness saturates with very large demonstration pools or student models (Kang et al., 20 Oct 2025).

Future directions include: recursive distillation (self-improving students), robust teacher selection, adaptive sample weighting, more efficient parameter-efficient tuning, combining ICD with rule- or RL-based feedback, and extending to domains with more complex relational or structured demonstrations.

References: