Understanding In-Context Learning for Nonlinear Regression with Transformers: Attention as Featurizer
Published 6 May 2026 in cs.LG and math.NA | (2605.05176v1)
Abstract: Pre-trained transformers are able to learn from examples provided as part of the prompt without any weight updates, a remarkable ability known as in-context learning (ICL). Despite its demonstrated efficacy across various domains, the theoretical understanding of ICL is still developing. Whereas most existing theory has focused on linear models, we study ICL in the nonlinear regression setting. Through the interaction mechanism in attention, we explicitly construct transformer networks to realize nonlinear features, such as polynomial or spline bases, which span a wide class of functions. Based on this construction, we establish a framework to analyze end-to-end in-context nonlinear regression with the constructed features. Our theory provides finite-sample generalization error bounds in terms of context length and training set size. We numerically validate the theory on synthetic regression tasks.
The paper demonstrates that transformer attention can explicitly construct nonlinear feature maps (polynomial and spline) without approximation error.
The paper provides finite-sample generalization error bounds that decompose bias and variance, linking performance to context length and sample size.
The paper validates through empirical tests that shallow, wide transformers achieve competitive regression performance with explicit featurization.
Explicit Featurization in Transformer In-Context Nonlinear Regression
Problem Setting and Prior Frameworks
The paper "Understanding In-Context Learning for Nonlinear Regression with Transformers: Attention as Featurizer" (2605.05176) investigates the theoretical and empirical properties of in-context learning (ICL) with transformers in nonlinear regression scenarios where prior work was limited to linear settings. ICL refers to a pre-trained transformer's capacity to assimilate task examples within a prompt at inference, eschewing further parameter updates. While linear in-context regression has been rigorously analyzed via implicit optimization and kernel methods, the extension to nonlinear regression raises the question of how transformers implicitly construct expressive feature spaces for regression, and what generalization guarantees apply.
Architectural Construction: Attention as Explicit Featurizer
Distinct from prior modular approaches which embed separate FFN-based feature extractors, this work demonstrates that transformer attention itself, enabled by the interaction mechanism and ReLU activation, can explicitly construct nonlinear feature maps—namely, polynomial and spline bases—without approximation error. The authors develop shallow, wide transformers (depth O(logd), width O(dn)) wherein early blocks build the feature matrix in parallel via multi-head attention, implementing arithmetic operations through the Interaction Lemma. The prompt is mapped to a Vandermonde or B-spline feature matrix; a final linear attention layer solves OLS regression in context, yielding predictions for queries conditioned solely on the provided context.
This architecture is universal for function classes approximable by polynomials or splines, and the feature construction does not incur error, a departure from deep FFN approximations whose error scales with depth and accuracy targets.
Generalization Error Bounds and Scaling Laws
The central theoretical results are finite-sample generalization error bounds for transformers in nonlinear in-context regression. For function classes F with polynomial or spline approximability, and contexts of length n with training sample size L, the generalization error admits a bias-variance decomposition:
Approximation error: O(nlogd) for polynomials and O(nlogm) for linear splines, reflecting the rate of feature basis approximation of the function class.
Statistical error: O(Lnetwork complexity), which depends polynomially on context length and feature basis—specifically, the number of attention heads required scales with n, a distinction from linear ICL where a single head suffices for arbitrary context lengths.
The theoretical bounds are corroborated by empirical results: test loss (MSE) exhibits scaling proportional to n−1 and O(dn)0, agreeing with the upper bounds but, in some cases, demonstrating faster empirical convergence (e.g., O(dn)1), an open question for future refinement.
Numerical Experiments and Ablations
Synthetic experiments on polynomial and spline regression tasks assess architecture choices. Attention-only transformers (without FFN) retain substantial regression capacity; performance between ReLU and softmax attention is comparable, while purely linear attention networks underperform in nonlinear tasks. Depth and width trade-offs are explored, confirming that parallel feature construction enables shallow designs with sufficient heads, but narrower, deeper variants also succeed. The required scaling of attention heads with context is mainly for theoretical guarantees; in practice, fewer heads suffice.
The spline regression extension leverages the piecewise linearity and improved conditioning inherent in B-splines, overcoming numerical instability issues in high-degree polynomial regression.
Theoretical and Practical Implications
This work advances theoretical understanding of ICL in transformers for nonlinear regression:
Attention mechanisms as interpretable, explicit feature constructors: The constructed attention weights perform arithmetic describing the regression basis, rendering the transformer operation more mechanistically transparent.
Universal shallow architectures: For a broad class of nonlinear functions, shallow transformers suffice (depth independent of accuracy), distinguishing them from FFN-based universal approximators.
Statistical and approximation error decomposition: The analysis facilitates systematic trade-offs in designing architectures for ICL with explicit scaling laws.
Practically, explicit featurization through attention could inform efficient transformer design for regression and arithmetic-heavy applications, where interpretability and capacity to generalize from few-shot context are crucial. Extension to multivariate and vector-valued function regression is straightforward, with theoretical guarantees maintained.
Future Directions
Potential research directions include tightening empirical/statistical error bounds to match observed scaling, extending explicit featurization to more general nonlinear feature classes (beyond polynomials and splines), and mechanistic interpretability studies of pre-trained transformer weights. The findings may also influence efficient parameterization and head allocation in deployed transformer systems for regression and symbolic reasoning tasks.
Conclusion
This paper develops a constructive theoretical framework for nonlinear regression via transformer in-context learning, establishing explicit architectural means to realize nonlinear features and providing rigorous generalization error bounds decomposed into statistical and approximation components. The empirical findings validate the theory and highlight the sufficiency of attention mechanisms for feature construction, deepening the mechanistic understanding of transformer ICL and informing future modeling strategies in nonlinear and arithmetic-rich tasks.