- The paper demonstrates that a Transformer with Θ(log n) parameters attains the minimax-optimal rate O(n^(−2α/(2α+d))) for nonparametric regression.
- It approximates local polynomial regression via kernel-weighted polynomial expansions and iterative gradient descent implemented through Transformer blocks.
- The approach drastically reduces sample complexity and parameter scaling compared to prior methods, enhancing efficiency in in-context learning.
Efficient and Minimax-Optimal In-Context Nonparametric Regression with Transformers
Introduction
This paper rigorously investigates the statistical efficiency and minimax optimality of Transformers for in-context learning (ICL) in nonparametric regression, focusing on regression functions with α-Hölder smoothness. The authors provide sharp theoretical guarantees, demonstrating that a Transformer with just Θ(logn) parameters and Ω(n2α/(2α+d)log3n) pretraining sequences can attain the minimax-optimal rate O(n−2α/(2α+d)) for the mean squared error, where n is the number of in-context examples and d is the covariate dimension. These parameter and pretraining requirements are significantly less restrictive than prior art, which typically assumes polynomial growth in n for both metrics.
The central approach leverages the ability of Transformers to efficiently approximate local polynomial regression estimators via kernel-weighted polynomial expansions and performs iterative optimization using gradient descent implemented within Transformer blocks.
Problem Setting
The problem is formalized as in-context nonparametric regression, where the data is generated as follows: for each pretraining sequence, a regression function is drawn from a Hölder class H(d,α,M), and a sample of size n+1 with covariates and additive noise is produced. Given n in-context labeled examples, the model predicts the response for a query covariate. Both empirical and population risks are defined over these pretraining episodes, and the transformer architecture is adapted accordingly.
The Transformer class considered consists of a stack of single-head linear attention layers followed by feedforward ReLU neural networks, taking as input the concatenated n-shot context examples and the query, with a straightforward positional marking in the embedding to distinguish the query.
Figure 1: What is the relationship between the chess opening "London System" and the London landmark Big Ben?
Theoretical Contributions
Comparison with Prior Work
Previous analyses (e.g., "Transformers are Minimax Optimal Nonparametric In-Context Learners" [kim2024transformers], "Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods" [shen2025understanding]) proved minimax rates are achievable via Transformers but required either Ω(n) or Ω(nd/(2α+d)) parameters, leading to stronger assumptions on the number of pretraining sequences.
This work establishes that minimax-optimal convergence in population risk can be achieved with
- Model size: Θ(logn) parameters (improved exponential savings)
- Pretraining sequences: Only Ω(n2α/(2α+d)log3n) (polylogarithmic reduction)
- No curse-of-dimensionality-reduction assumptions or restrictive parameter scaling
Approximation of Local Polynomial Regression
The main result demonstrates that Transformers approximate the truncated local polynomial estimator to within O(1/n) in population risk. The construction involves:
- Embedding context and query with minor augmentation, omitting complex positional encodings
- Using three Transformer blocks to center and scale inputs and compute kernel weights
- Constructing kernel-weighted polynomial basis and responses via ReLU FFN layers, which approximate polynomials exponentially fast in the number of layers
- Iteratively applying gradient steps (each implemented as a Transformer block) to solve the kernel-weighted least squares problem
- Achieving surrogate gradient descent convergence with exponentially fewer steps due to strong convexity in the local polynomial objective
Consequently, minimax-optimal rates ∼n−2α/(2α+d) are achieved for α-Hölder smooth regression with only logarithmic depth.
The architecture explicitly designs blocks such that the initial network computes the centered covariates and kernel weights, and sequential blocks expand these into kernel-weighted monomial features and responses. Subsequent blocks execute gradient descent for the weighted least-squares optimization—which is provably implementable via linear or ReLU attention—effectively learning the coefficients of the local polynomial estimator at test time.
This approach is applicable to a broad class of attention mechanisms, including softmax and ReLU, with mild modifications. The proofs invoke empirical process theory and precise covering number bounds to control approximation and statistical risks.
Statistical Guarantees and Sample Complexity
Theoretical results include:
- Minimax-optimal mean squared error for ICL using Transformers: For any α>0, and regression functions in H(d,α,M), the constructed Transformer achieves error within Cn−2α/(2α+d), with C dependent only on distributional parameters and the architecture.
- Sample complexity of pretraining sequences: Γ≳n2α/(2α+d)log3n suffices, a notable reduction compared to other nonparametric deep learning approaches.
- Parameter efficiency: Just Θ(logn) parameters are required for optimal adaptation, an exponential gap versus prior constructions and standard feedforward networks (which require Ω(nd/(2α+d))).
The results hold for empirical risk minimizers as well as any Transformer that achieves close-to-optimal training loss within the parameterized class, confirming robustness to practical training considerations.
Architecture and Proof Strategy
The authors provide a detailed constructive proof, showing how fundamental components (embedding, attention, FFN) can simulate every stage of the local polynomial regression pipeline. They analyze the approximation power, the effect of parameter quantization, and the associated empirical process covering numbers, ensuring statistical error decays at the minimax rate with mild requirements.
Key technical arguments involve:
- Exponential-depth-to-approximation tradeoffs leveraging FFN polynomial approximation theory [LuShenYangZhang2020]
- Direct implementation of gradient descent within the attention mechanism [bai2023transformers, von2023transformers]
- Strong convexity and empirical process control for global convergence and generalization bounds
Implications and Future Directions
This work provides a rigorous foundation for the sample and parameter efficiency of in-context learning with Transformers in nonparametric regression. It bridges classical local polynomial methods and modern sequence models, framing attention as meta-optimization over kernel expansions.
The demonstrated parameter efficiency and statistical optimality fundamentally inform the theoretical limits for ICL in tabular and function approximation tasks, indicating that, under sufficient smoothness, large-scale overparameterization is not necessary for optimal generalization in the in-context setting.
Future research directions suggested include:
- Extending the analysis to next-token prediction, online, or dependent settings—critical for understanding generalization in autoregressive LLMs for sequential tasks
- Exploiting or learning low-dimensional structures (e.g., sparsity or manifold assumptions) to further mitigate the curse of dimensionality and reduce sample complexity
- Studying optimization landscape and generalization under practical gradient-based training (as opposed to empirical risk minimizers)
- Designing efficient prompts and architectural inductive biases informed by the theory to improve adaptivity in real-world LLMs and tabular foundation models
Conclusion
This paper rigorously establishes that Transformers can serve as minimax-optimal, exponentially parameter-efficient nonparametric regression estimators in the in-context learning setting, with sharply reduced pretraining requirements compared to prior work. By structurally aligning the attention and FFN blocks with the steps of local polynomial regression, the results advance both the statistical and algorithmic understanding of deep sequence models as universal meta-learners for smooth functions.