In-Context Learning in Transformers

Updated 26 February 2026

In-context learning in transformers is a paradigm where models adapt to new tasks using demonstration examples without weight updates, enabled by self-attention mechanisms.
It employs phase transitions and eigenmode-specific dynamics that guide learning from intra-task to intra-problem generalization.
Empirical and theoretical analyses reveal limits in inter-problem generalization, highlighting the critical role of prompt design and pretraining diversity.

In-context learning (ICL) in transformers is the phenomenon whereby a model can adapt to new tasks solely by conditioning on a sequence of demonstration examples provided in its prompt, without any updates to its weights. This mechanism is central to the versatility of LLMs and vision transformers, enabling rapid task adaptation and algorithm emulation in-context, but its theoretical underpinnings, practical capabilities, and limitations remain an active field of research.

1. Mechanistic Foundations and Analytical Dynamics

ICL emerges in transformer models through the forward propagation of information in the context window, where self-attention layers allow hidden states—particularly the final token or position—to aggregate and process the information from all earlier (input, label) pairs. In the case of linear regression tasks and linear transformers, the ICL process admits an exact analytical characterization. Consider a one-layer, single-head linear transformer with parameter matrices $p_2(t)$ (output/value block) and $q_1(t)$ (query/key block). Training with stochastic gradient descent (SGD) on a context of $N$ examples induces per-eigenmode updates:

$\Delta p_\alpha = -\eta P s_\alpha^2 [p_\alpha q_\alpha^2 s_\alpha^\infty - q_\alpha]$

$\Delta q_\alpha = -\eta P s_\alpha^2 [q_\alpha p_\alpha^2 s_\alpha^\infty - p_\alpha]$

where $\alpha$ indexes the eigenmodes of the data covariance, $s_\alpha$ their eigenvalues, and $P$ the number of tasks. This leads, in continuous-time, to decoupled dynamics for each $\alpha$ :

$\tau_\alpha \frac{d p_\alpha}{dt} = q_\alpha [1 - p_\alpha q_\alpha s_\alpha^\infty], \qquad \tau_\alpha \frac{d q_\alpha}{dt} = p_\alpha [1 - p_\alpha q_\alpha s_\alpha^\infty], \quad \tau_\alpha = (\eta P s_\alpha^2)^{-1}$

Defining $q_1(t)$ 0, one obtains a logistic equation with closed-form solution, revealing how different modes (directions of data variance) are learned on separated timescales, with loss plateaus followed by rapid transitions as slower modes activate. The system respects a conservation law $q_1(t)$ 1, reflecting an underlying scaling symmetry that limits the effective parameter dynamics to a low-dimensional manifold (Mainali et al., 17 Apr 2025).

Although the architecture is linear in its attention, the learning dynamics are intrinsically nonlinear, involving products of the weight blocks (e.g., $q_1(t)$ 2).

2. Task Generalization Frameworks and Empirical Boundaries

The ability of ICL in transformers to generalize is fundamentally shaped by the diversity and structure of pretraining or finetuning data. A systematic task-centric framework distinguishes three levels:

Intra-task generalization: Can the model generalize to new parameterizations of a known function class after seeing multiple in-context examples from that class?
Intra-problem generalization: Can the model, having seen a sample of function classes (e.g., combinations, compositions), generalize to new combinations within a known family?
Inter-problem generalization: Can the model generalize across distinct problem families (e.g., from sinusoidal to their products or compositions) solely through in-context adaptation?

Empirical studies show a strong ICL capability for intra-task and intra-problem settings: for example, GPT-2 models finetuned on a single convex sum can generalize via ICL to all pairwise sums, and performance improves monotonically with exposure to more composite classes. By contrast, inter-problem generalization fails: even LLMs pretrained on trillions of tokens do not generalize to novel composite operations not seen in training; a small number of novel composite examples during finetuning is essential for ICL (Zhang et al., 19 Mar 2025).

ICL performance and generalization are further enhanced by mixing simple and complex tasks during training, carefully crafting prompts to cover potential deployment domains, and monitoring distributional alignment between in-context examples and test conditions.

3. Induction Heads, Emergent Circuitry, and Phase Transitions

A key mechanistic structure in transformer ICL is the "induction head," a circuit discovered in two-layer attention-only models trained on minimal ICL tasks. The parameter space of such models, despite being high-dimensional, collapses during training to a finite-dimensional subspace (19-dimensional for a disentangled architecture with block-structured weights), with only three directions (coefficients of specific blocks) governing the emergence of the induction head.

Learning proceeds phasewise: the output readout is learned first, followed by the alignment of a second attention head to label positions, and finally the sharpening of an "item-matching" attention head. The time to induction head emergence is quadratic in the context length, i.e., $q_1(t)$ 3, set by the weakest signals to be extracted and amplified in the data (Musat et al., 2 Nov 2025). This mirrors the separation of timescales and staged, mode-specific learning phases also seen in linear models (Mainali et al., 17 Apr 2025).

The induction head precisely implements copy-and-retrieve logic, enabling the classical "induction" behavior in ICL: associating and retrieving the correct label for a repeated input in the prompt.

4. Macroscopic Diagnostics: Spectral and Subspace Dynamics

Beyond microscopic weight trajectories, two macroscopic metrics provide interpretable signatures of emergent ICL:

Spectral Rank Dynamics (EffRank):

$q_1(t)$ 4

computes a soft measure of the active dimensionality of a weight matrix $q_1(t)$ 5 (e.g., Query-Key or Output-Value). During training, EffRank typically shows sequential "dips" marking the sequential activation of data modes, with dips aligning to sudden performance improvements (the "cliff" in ICL learning curves).

Subspace Stability (SubDist):

$q_1(t)$ 6

where $q_1(t)$ 7 projects onto top- $q_1(t)$ 8 singular subspace at time $q_1(t)$ 9 and $N$ 0 at convergence, measures when the principal subspace stabilizes, often occurring before mode amplitudes saturate.

These macroscopic measures reveal that subspace alignment and spectral mode activation underlie the abrupt functional transitions seen in networks trained on regression and modular arithmetic, providing an interpretable "training fingerprint" of capability emergence. For instance, in grokking phenomena (delayed generalization), subspace stability collapses only when test error sharply drops, not at the train loss minima (Mainali et al., 17 Apr 2025).

5. Extensions to Unstructured Data, Non-Linear and Vision Domains

ICL is robustly realized in structured prompt settings (with clear $N$ 1 pairs), but in practice, transformers process unstructured or interleaved data. For shallow architectures, a single attention layer cannot reliably pair inputs and outputs; a two-layer transformer with masking (enforcing causal attention) can implement in-context linear regression, with performance further improved by adding positional encoding, which enables the network to align adjacent tokens and supports generalization in unstructured input regimes (Xing et al., 2024).

This analysis generalizes to vision transformers, where prompt examples are image-target pairs and the transformer, trained on randomly sampled function targets (linear, CNNs, ViTs), can match or outperform baseline methods in both linear and non-linear settings, provided the architecture is sufficiently deep and prompt design is calibrated to the input and output structure (Zhao et al., 27 May 2025).

6. Limitations, Technical Debt, and Theoretical Boundaries

ICL in transformers is not universally optimal. While Bayes-optimal rates can be approached in the few-shot regime, a technical debt appears as the context length increases: the excess risk of the model (relative to the Bayes estimator) plateaus and cannot be driven to zero by simply providing more in-context examples, even as sample complexity is increased (Joo et al., 7 Feb 2025). This phenomenon is inherent to the paradigm of "learning in activations" without weight updates, and scaling the model or context window does not fundamentally resolve it.

Moreover, the emergence of genuine ICL (as distinct from task memorization) depends critically on pretraining task diversity: to achieve generalization in a $N$ 2-dimensional regression setting, both the context length $N$ 3 and the number of unique tasks $N$ 4 in pretraining must scale at least as $N$ 5, with the total number of pretraining examples needing to scale as $N$ 6. Below this regime, the model merely memorizes or interpolates previously seen tasks (Lu et al., 2024). These sample and task complexity thresholds are not mitigated by scaling input data or model size in isolation.

7. Implications for Practical Model Development and Open Directions

Curriculum and Data Design: Prompt and pretraining curricula that maximize task diversity and mix task complexities expedite in-context generalization and mitigate learning plateaus. Low-complexity warm-ups, mixed-task batches, and auxiliary supervision on "weights components" of the hidden state all accelerate ICL skill acquisition and eliminate the plateaus seen under naive training (Fu et al., 2023).
Interpretable Monitoring: Spectral and subspace diagnostics (EffRank, SubDist) provide actionable monitoring of capability emergence, allowing for targeted behavioral evaluation and interpretability during critical training phases (Mainali et al., 17 Apr 2025).
Theoretical Generalization: Modern universal approximation results show that deep, sufficiently wide transformers can in-context emulate nonparametric function classes, with error rates matching minimax-optimal predictors in smooth function classes, provided that pretraining is sufficiently rich and lower layers have learned the appropriate basis representations (Kim et al., 2024, Li et al., 5 Jun 2025).
Intrinsic Barriers and Architectural Limits: Not all tasks are amenable to ICL (e.g., parity functions, unseen composition families); explicit positional encoding is necessary for universal approximation with finite vocabulary models (Ma et al., 9 Nov 2025), and model class constraints can inhibit certain forms of compositional generalization or logic reasoning unless the training data structurally exposes such operations (Wibisono et al., 2024).
Future Directions: Open research fronts include adaptive mechanisms for on-the-fly parameter updating during inference (hybrid ICL–meta-learning), curriculum discovery algorithms for unsupervised stabilization of "weights components," rigorous quantification of phase transitions in more complex architectures, and systematic study of the interplay between representation geometry and emergent learning plateaus.

References

Exact learning dynamics and macroscopic measures: (Mainali et al., 17 Apr 2025)
Generalization limits and task-centric framework: (Zhang et al., 19 Mar 2025)
Induction head emergence and phase transitions: (Musat et al., 2 Nov 2025)
Shallow transformer adaptation to unstructured prompts: (Xing et al., 2024)
Technical debt and efficiency loss in ICL: (Joo et al., 7 Feb 2025)
Pretraining task diversity and asymptotics: (Lu et al., 2024)
Spectral and subspace diagnostics: (Mainali et al., 17 Apr 2025)
Universal approximation and function class coverage: (Li et al., 5 Jun 2025, Kim et al., 2024)
Vocabulary models and positional encoding: (Ma et al., 9 Nov 2025)
Learning plateaus and regularization: (Fu et al., 2023)
Vision transformer ICL case study: (Zhao et al., 27 May 2025)