In-Context Learning for Nonparametric Regression

Updated 22 January 2026

In-context nonparametric regression is a paradigm where transformers act as meta-learners, inferring predictors from a few labeled examples without updating their parameters.
Transformer architectures simulate classical estimators—using linear attention and ReLU networks—to approximate local polynomial and kernel regression methods efficiently.
Theoretical guarantees show that with appropriate pretraining and dynamic prior adaptation, these models achieve minimax rate convergence while reducing mean squared error.

In-context learning (ICL) for nonparametric regression refers to the ability of sequence models—most notably transformers—to perform regression on an unseen task by conditioning exclusively on a context window of labeled input–output pairs, without updating model parameters. In this paradigm, the model acts as a meta-learner: it is pretrained across a distribution of nonparametric regression tasks (function classes of unbounded complexity/smoothness) and, at test time, must infer an appropriate predictor from a handful of in-context examples. Modern theoretical work rigorously quantifies the statistical optimality and computational efficiency of ICL for nonparametric regression, demonstrating that properly architected and pretrained transformers achieve minimax rates of convergence that match or improve upon classical estimators, even for highly complex function classes such as Besov, Hölder, and general Lipschitz spaces (Kim et al., 2024, Ching et al., 21 Jan 2026, Li et al., 28 Jul 2025).

1. Problem Formulation and Task Distribution

The canonical setup for in-context nonparametric regression considers a family of tasks drawn i.i.d. from a task meta-distribution. Each task is defined by a regression function $m$ or $F_\beta$ sampled from a general nonparametric class (e.g., $\alpha$ -Hölder balls $H(d,\alpha,M)$ or Besov spaces), and a context of $n$ i.i.d. labeled pairs $(x_i, y_i)$ with additive noise. The learner is given this context and a new query $x_{n+1}$ and is required to predict $y_{n+1}$ . The minimax mean squared error for estimating $m$ at a new point, across all possible functions in the class, given $n$ context points, is known to be $n^{-2\alpha/(2\alpha + d)}$ for $\alpha$ -smooth regression in $d$ dimensions (Kim et al., 2024, Ching et al., 21 Jan 2026).

Throughout pretraining, the transformer receives batched prompts containing $(x_1, y_1),\ldots,(x_n, y_n)$ and $x_{n+1}$ on tasks indexed by $\gamma=1,\ldots,\Gamma$ and is trained via empirical risk minimization (MSE) to predict $y_{n+1}$ . At test time, performance is assessed by the expectation

$R(f) = \mathbb{E}\left[(Y_{n+1} - f(D_n, X_{n+1}))^2\right]$

where $f$ is the in-context predictor, and the expectation is over the task-generating process and training noise.

2. Transformer Architecture and Expressivity

Transformers employed for nonparametric in-context regression use a sequence model endowed with either a single or few linear attention heads stacked atop a deep neural feature extractor (typically a ReLU network), together with shallow feed-forward networks (FFNs) in each layer (Kim et al., 2024, Ching et al., 21 Jan 2026, Li et al., 28 Jul 2025). The sequence length is $n+1$ : $n$ context tokens and $1$ query token. Each token encodes both input features and, for context tokens, outputs; the query token contains only input data.

A canonical embedding maps each token $(x_i, y_i)$ (or $(x_{n+1}, 0)$ for the query) into a high-dimensional vector within the model’s state. No explicit positional encoding is strictly necessary, as indicator variables or learned position-specific fields suffice. Each transformer block consists of:

Linear attention: Implements a softmax-weighted or approximate ridge-like operation associating the query token with a function of the context tokens’ representations and values.
ReLU-FFN layers: Used for constructing nonlinear feature maps and enabling efficient approximation of high-degree polynomials.
Sparse and weight-sharing structure: To minimize parameter count, state-of-the-art constructions use only $O(\log n)$ layers and total parameters, leveraging repeated composition and the approximation properties of deep networks (Ching et al., 21 Jan 2026).

This design allows the transformer to efficiently simulate local polynomial regression solvers or kernel estimators within its forward pass, a crucial ingredient for achieving statistical optimality in nonparametric classes.

3. Approximation of Nonparametric Estimators within Transformers

The transformer is explicitly constructed to approximate classical nonparametric estimators. For $\alpha$ -smooth regression—in particular, local polynomial estimators of order $p = \lceil\alpha\rceil$ —the model carries out the following steps internally (Ching et al., 21 Jan 2026):

Centering/Scaling: Compute $(X_i - X_{n+1})/h$ for appropriate bandwidth $h$ .
Kernel weighting: Implement kernel functions $K_h(X_i - X_{n+1})$ to localize the regression.
Feature construction: Apply polynomial feature maps (monomials up to order $p$ ), efficiently approximated via ReLU networks, to encode $P_h(X_i - X_{n+1})$ .
Weighted least-squares solve: Employ linear attention and FFN modules to perform gradient steps toward the solution of the kernel-weighted local polynomial regression at the query point.

The approximation error in each step is controlled by the depth and width of the network, and all steps can be efficiently implemented in $O(\log n)$ layers with provably vanishing error (Ching et al., 21 Jan 2026). For Besov classes, the oracle construction involves encoding the basis functions $\{\psi_j\}$ as features and attention maps a ridge estimator onto these bases (Kim et al., 2024).

4. Theoretical Guarantees: Minimax Rates, Risk Decomposition, and Lower Bounds

The central theoretical results establish that transformers, as in-context learners, achieve the minimax rate of convergence for nonparametric regression, with rigorous upper and lower error bounds (Kim et al., 2024, Ching et al., 21 Jan 2026). Specifically,

Error decomposition: The total risk consists of (i) approximation error, (ii) in-context generalization gap (scarcity of in-context samples), and (iii) pretraining generalization gap (finite pretraining tasks):

$\overline{R}(\hat{\Theta}) \lesssim N^{-2\alpha/d} \; (\text{approx.}) + \frac{N \log N}{n} \; (\text{in-context}) + \frac{N^2 \log N}{T} \; (\text{pretraining})$

By properly selecting the number of features $N \asymp n^{d / (2\alpha + d)}$ and increasing the number of pretraining tasks $T \gg n^{(2\alpha + 2d)/(2\alpha + d)}$ , the overall MSE matches the minimax rate $n^{-2\alpha/(2\alpha + d)}$ .

Parameter and sample-optimality: Transformers can reach this rate using only $O(\log n)$ parameters and $O(n^{2\alpha/(2\alpha+d)} \log^3 n)$ pretraining sequences, representing exponential improvement in both regime over earlier constructions that required polynomial parameter and sample complexity (Ching et al., 21 Jan 2026).
Minimax lower bounds: Via Fano-type information-theoretic arguments, it is shown that no meta-learning method—whether a transformer or any other architecture—can surpass the minimax rate given the available number of in-context and pretraining samples, for general smoothness classes (Kim et al., 2024, Ching et al., 21 Jan 2026).

For function classes beyond Hölder or Besov, such as L-Lipschitz or piecewise smooth functions, similar minimax rates and error decompositions hold, with the smoothness parameter $\alpha$ adapted accordingly (Kim et al., 2024, Li et al., 28 Jul 2025).

5. Training Dynamics and Attention Behavior in Nonlinear Function Classes

In the setting where underlying tasks are nonlinear $L$ -Lipschitz functions, a one-layer transformer trained by gradient descent exhibits a two-phase attention dynamic during pretraining (Li et al., 28 Jul 2025):

Phase I ("feature separation"): Rapid growth in attention score between the query and the prompt tokens with highly relevant features. This phase is governed by the Lipschitz constant $L$ and intrinsic feature separation $\Delta$ .
Phase II ("steady alignment"): Progressive convergence of attention to the target feature; off-diagonal/non-relevant attention decays more slowly. Final error and convergence time depend on whether $L$ is below or above a threshold $L_0$ determined by the data geometry.

Explicit time bounds are provided:

For $L \leq L_0$ , convergence is fast $T = O(K \log(K/\varepsilon)/(L^2 \Delta^2 \delta^2))$ .
For $L \geq L_0$ , the convergence time includes additional dependence on $\varepsilon^{-1}$ .

At convergence, the query prediction is controlled by attention to the correct support; error is at most $O(\varepsilon)$ for any unseen $f \in F_L$ . This demonstrates that transformers can interpolate unseen nonlinear functions via ICL, with rigorous attention dynamics tracking and quantifiable adaptation time (Li et al., 28 Jul 2025).

6. Meta-In-Context Learning and Dynamic Prior Adaptation

For large pretrained LLMs such as GPT-3/4, meta-in-context learning further enhances in-context regression by reshaping the model’s implicit task priors through exposure to multiple related tasks in the prompt window (Coda-Forno et al., 2023). Empirically:

Hierarchical adaptation: When provided with sequences of regression tasks, the model’s predictions for new tasks reflect a shift in its prior over function parameters (e.g., slope and intercept in 1D linear regression), even without parameter updates. The effective prior for the next task is given by Bayesian integration over previous observed tasks.
Empirical gains: In-context predictors exhibit reduced mean squared error (by $20$– $30\%$ ) and improved calibration on real-world regression benchmarks after meta-in-context adaptation. Gains appear greatest when past tasks are similar to the current one.
Saturation: Improvements plateau after 2–3 tasks with 5 examples each; further tasks yield diminishing returns due to context window limitations.

This suggests that in-context learning in LLMs implements a nonparametric Bayesian regressor whose prior itself can be updated online by recursive in-context learning, without finetuning or explicit meta-learning code.

7. Extensions: High-Dimensional, Sequential, and Anisotropic Settings

The ICL paradigm and its minimax guarantees extend beyond conventional Euclidean domains:

Anisotropic smoothness: For regression functions with heterogeneous smoothness $\{\alpha_i\}$ across dimensions, the minimax rate is governed by the harmonic mean $\tilde{\alpha} = ( \sum_i 1/\alpha_i )^{-1}$ , with transformers achieving $n^{-2\tilde{\alpha}/(2\tilde{\alpha} + 1)}$ up to logarithmic factors (Kim et al., 2024).
Piecewise and mixed-smooth functions: For sequential or infinite-dimensional data, minimax rates are achieved relative to the “maximal” or harmonic smoothness parameter, with only mild adaptation required in model construction.
Task diversity: The number of pretraining tasks required to close the pretraining gap scales with the square of the effective feature dimension. Diversity in pretraining is identified as a critical ingredient for strong ICL capabilities in both synthetic and real-world settings (Kim et al., 2024, Ching et al., 21 Jan 2026).

These advances collectively establish that simple, scalable transformer architectures, when appropriately pre-trained and architectured, act as minimax-optimal nonparametric in-context learners—efficiently embedding classical regression solvers within their forward passes and dynamically adapting to unseen tasks (Kim et al., 2024, Ching et al., 21 Jan 2026, Li et al., 28 Jul 2025, Coda-Forno et al., 2023).