Papers
Topics
Authors
Recent
Search
2000 character limit reached

In-Context Learning for Nonparametric Regression

Updated 22 January 2026
  • In-context nonparametric regression is a paradigm where transformers act as meta-learners, inferring predictors from a few labeled examples without updating their parameters.
  • Transformer architectures simulate classical estimators—using linear attention and ReLU networks—to approximate local polynomial and kernel regression methods efficiently.
  • Theoretical guarantees show that with appropriate pretraining and dynamic prior adaptation, these models achieve minimax rate convergence while reducing mean squared error.

In-context learning (ICL) for nonparametric regression refers to the ability of sequence models—most notably transformers—to perform regression on an unseen task by conditioning exclusively on a context window of labeled input–output pairs, without updating model parameters. In this paradigm, the model acts as a meta-learner: it is pretrained across a distribution of nonparametric regression tasks (function classes of unbounded complexity/smoothness) and, at test time, must infer an appropriate predictor from a handful of in-context examples. Modern theoretical work rigorously quantifies the statistical optimality and computational efficiency of ICL for nonparametric regression, demonstrating that properly architected and pretrained transformers achieve minimax rates of convergence that match or improve upon classical estimators, even for highly complex function classes such as Besov, Hölder, and general Lipschitz spaces (Kim et al., 2024, Ching et al., 21 Jan 2026, Li et al., 28 Jul 2025).

1. Problem Formulation and Task Distribution

The canonical setup for in-context nonparametric regression considers a family of tasks drawn i.i.d. from a task meta-distribution. Each task is defined by a regression function mm or FβF_\beta sampled from a general nonparametric class (e.g., α\alpha-Hölder balls H(d,α,M)H(d,\alpha,M) or Besov spaces), and a context of nn i.i.d. labeled pairs (xi,yi)(x_i, y_i) with additive noise. The learner is given this context and a new query xn+1x_{n+1} and is required to predict yn+1y_{n+1}. The minimax mean squared error for estimating mm at a new point, across all possible functions in the class, given nn context points, is known to be n2α/(2α+d)n^{-2\alpha/(2\alpha + d)} for α\alpha-smooth regression in dd dimensions (Kim et al., 2024, Ching et al., 21 Jan 2026).

Throughout pretraining, the transformer receives batched prompts containing (x1,y1),,(xn,yn)(x_1, y_1),\ldots,(x_n, y_n) and xn+1x_{n+1} on tasks indexed by γ=1,,Γ\gamma=1,\ldots,\Gamma and is trained via empirical risk minimization (MSE) to predict yn+1y_{n+1}. At test time, performance is assessed by the expectation

R(f)=E[(Yn+1f(Dn,Xn+1))2]R(f) = \mathbb{E}\left[(Y_{n+1} - f(D_n, X_{n+1}))^2\right]

where ff is the in-context predictor, and the expectation is over the task-generating process and training noise.

2. Transformer Architecture and Expressivity

Transformers employed for nonparametric in-context regression use a sequence model endowed with either a single or few linear attention heads stacked atop a deep neural feature extractor (typically a ReLU network), together with shallow feed-forward networks (FFNs) in each layer (Kim et al., 2024, Ching et al., 21 Jan 2026, Li et al., 28 Jul 2025). The sequence length is n+1n+1: nn context tokens and $1$ query token. Each token encodes both input features and, for context tokens, outputs; the query token contains only input data.

A canonical embedding maps each token (xi,yi)(x_i, y_i) (or (xn+1,0)(x_{n+1}, 0) for the query) into a high-dimensional vector within the model’s state. No explicit positional encoding is strictly necessary, as indicator variables or learned position-specific fields suffice. Each transformer block consists of:

  • Linear attention: Implements a softmax-weighted or approximate ridge-like operation associating the query token with a function of the context tokens’ representations and values.
  • ReLU-FFN layers: Used for constructing nonlinear feature maps and enabling efficient approximation of high-degree polynomials.
  • Sparse and weight-sharing structure: To minimize parameter count, state-of-the-art constructions use only O(logn)O(\log n) layers and total parameters, leveraging repeated composition and the approximation properties of deep networks (Ching et al., 21 Jan 2026).

This design allows the transformer to efficiently simulate local polynomial regression solvers or kernel estimators within its forward pass, a crucial ingredient for achieving statistical optimality in nonparametric classes.

3. Approximation of Nonparametric Estimators within Transformers

The transformer is explicitly constructed to approximate classical nonparametric estimators. For α\alpha-smooth regression—in particular, local polynomial estimators of order p=αp = \lceil\alpha\rceil—the model carries out the following steps internally (Ching et al., 21 Jan 2026):

  1. Centering/Scaling: Compute (XiXn+1)/h(X_i - X_{n+1})/h for appropriate bandwidth hh.
  2. Kernel weighting: Implement kernel functions Kh(XiXn+1)K_h(X_i - X_{n+1}) to localize the regression.
  3. Feature construction: Apply polynomial feature maps (monomials up to order pp), efficiently approximated via ReLU networks, to encode Ph(XiXn+1)P_h(X_i - X_{n+1}).
  4. Weighted least-squares solve: Employ linear attention and FFN modules to perform gradient steps toward the solution of the kernel-weighted local polynomial regression at the query point.

The approximation error in each step is controlled by the depth and width of the network, and all steps can be efficiently implemented in O(logn)O(\log n) layers with provably vanishing error (Ching et al., 21 Jan 2026). For Besov classes, the oracle construction involves encoding the basis functions {ψj}\{\psi_j\} as features and attention maps a ridge estimator onto these bases (Kim et al., 2024).

4. Theoretical Guarantees: Minimax Rates, Risk Decomposition, and Lower Bounds

The central theoretical results establish that transformers, as in-context learners, achieve the minimax rate of convergence for nonparametric regression, with rigorous upper and lower error bounds (Kim et al., 2024, Ching et al., 21 Jan 2026). Specifically,

  • Error decomposition: The total risk consists of (i) approximation error, (ii) in-context generalization gap (scarcity of in-context samples), and (iii) pretraining generalization gap (finite pretraining tasks):

R(Θ^)N2α/d  (approx.)+NlogNn  (in-context)+N2logNT  (pretraining)\overline{R}(\hat{\Theta}) \lesssim N^{-2\alpha/d} \; (\text{approx.}) + \frac{N \log N}{n} \; (\text{in-context}) + \frac{N^2 \log N}{T} \; (\text{pretraining})

By properly selecting the number of features Nnd/(2α+d)N \asymp n^{d / (2\alpha + d)} and increasing the number of pretraining tasks Tn(2α+2d)/(2α+d)T \gg n^{(2\alpha + 2d)/(2\alpha + d)}, the overall MSE matches the minimax rate n2α/(2α+d)n^{-2\alpha/(2\alpha + d)}.

  • Parameter and sample-optimality: Transformers can reach this rate using only O(logn)O(\log n) parameters and O(n2α/(2α+d)log3n)O(n^{2\alpha/(2\alpha+d)} \log^3 n) pretraining sequences, representing exponential improvement in both regime over earlier constructions that required polynomial parameter and sample complexity (Ching et al., 21 Jan 2026).
  • Minimax lower bounds: Via Fano-type information-theoretic arguments, it is shown that no meta-learning method—whether a transformer or any other architecture—can surpass the minimax rate given the available number of in-context and pretraining samples, for general smoothness classes (Kim et al., 2024, Ching et al., 21 Jan 2026).

For function classes beyond Hölder or Besov, such as L-Lipschitz or piecewise smooth functions, similar minimax rates and error decompositions hold, with the smoothness parameter α\alpha adapted accordingly (Kim et al., 2024, Li et al., 28 Jul 2025).

5. Training Dynamics and Attention Behavior in Nonlinear Function Classes

In the setting where underlying tasks are nonlinear LL-Lipschitz functions, a one-layer transformer trained by gradient descent exhibits a two-phase attention dynamic during pretraining (Li et al., 28 Jul 2025):

  • Phase I ("feature separation"): Rapid growth in attention score between the query and the prompt tokens with highly relevant features. This phase is governed by the Lipschitz constant LL and intrinsic feature separation Δ\Delta.
  • Phase II ("steady alignment"): Progressive convergence of attention to the target feature; off-diagonal/non-relevant attention decays more slowly. Final error and convergence time depend on whether LL is below or above a threshold L0L_0 determined by the data geometry.

Explicit time bounds are provided:

  • For LL0L \leq L_0, convergence is fast T=O(Klog(K/ε)/(L2Δ2δ2))T = O(K \log(K/\varepsilon)/(L^2 \Delta^2 \delta^2)).
  • For LL0L \geq L_0, the convergence time includes additional dependence on ε1\varepsilon^{-1}.

At convergence, the query prediction is controlled by attention to the correct support; error is at most O(ε)O(\varepsilon) for any unseen fFLf \in F_L. This demonstrates that transformers can interpolate unseen nonlinear functions via ICL, with rigorous attention dynamics tracking and quantifiable adaptation time (Li et al., 28 Jul 2025).

6. Meta-In-Context Learning and Dynamic Prior Adaptation

For large pretrained LLMs such as GPT-3/4, meta-in-context learning further enhances in-context regression by reshaping the model’s implicit task priors through exposure to multiple related tasks in the prompt window (Coda-Forno et al., 2023). Empirically:

  • Hierarchical adaptation: When provided with sequences of regression tasks, the model’s predictions for new tasks reflect a shift in its prior over function parameters (e.g., slope and intercept in 1D linear regression), even without parameter updates. The effective prior for the next task is given by Bayesian integration over previous observed tasks.
  • Empirical gains: In-context predictors exhibit reduced mean squared error (by $20$–30%30\%) and improved calibration on real-world regression benchmarks after meta-in-context adaptation. Gains appear greatest when past tasks are similar to the current one.
  • Saturation: Improvements plateau after 2–3 tasks with 5 examples each; further tasks yield diminishing returns due to context window limitations.

This suggests that in-context learning in LLMs implements a nonparametric Bayesian regressor whose prior itself can be updated online by recursive in-context learning, without finetuning or explicit meta-learning code.

7. Extensions: High-Dimensional, Sequential, and Anisotropic Settings

The ICL paradigm and its minimax guarantees extend beyond conventional Euclidean domains:

  • Anisotropic smoothness: For regression functions with heterogeneous smoothness {αi}\{\alpha_i\} across dimensions, the minimax rate is governed by the harmonic mean α~=(i1/αi)1\tilde{\alpha} = ( \sum_i 1/\alpha_i )^{-1}, with transformers achieving n2α~/(2α~+1)n^{-2\tilde{\alpha}/(2\tilde{\alpha} + 1)} up to logarithmic factors (Kim et al., 2024).
  • Piecewise and mixed-smooth functions: For sequential or infinite-dimensional data, minimax rates are achieved relative to the “maximal” or harmonic smoothness parameter, with only mild adaptation required in model construction.
  • Task diversity: The number of pretraining tasks required to close the pretraining gap scales with the square of the effective feature dimension. Diversity in pretraining is identified as a critical ingredient for strong ICL capabilities in both synthetic and real-world settings (Kim et al., 2024, Ching et al., 21 Jan 2026).

These advances collectively establish that simple, scalable transformer architectures, when appropriately pre-trained and architectured, act as minimax-optimal nonparametric in-context learners—efficiently embedding classical regression solvers within their forward passes and dynamically adapting to unseen tasks (Kim et al., 2024, Ching et al., 21 Jan 2026, Li et al., 28 Jul 2025, Coda-Forno et al., 2023).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to In-Context Learning for Nonparametric Regression.