Papers
Topics
Authors
Recent
Search
2000 character limit reached

Transformers as Kernel Regressors

Updated 9 February 2026
  • Transformers as kernel regressors reinterpret attention as a nonparametric, kernel-based weighting mechanism for in-context learning.
  • This framework leverages classical statistical learning and RKHS principles to justify efficient and adaptive transformer architectures.
  • Extensions with sparse, learned, and adaptive kernels refine model inductive bias and scaling, enhancing robustness and performance.

Transformers as Kernel Regressors

Transformers, originally designed for large-scale sequence modeling, can be rigorously interpreted as kernel regression machines. In this framework, the attention mechanism—the architectural core of transformers—implements nonparametric regression by weighting “value” vectors according to learned similarity (“kernel”) functions over “queries” and “keys.” This connection unifies the empirical success of transformers in in-context learning (ICL) and sequence modeling with classical statistical learning theory, and has led to fundamental advances in understanding, modifying, and extending transformer architectures for efficient, robust, and theoretically principled learning.

1. Mathematical Foundations: Transformers and Kernel Regression

The core operation of transformer attention is the computation

Attention(q,{kj,vj}j=1n)=j=1nαj(q)vj\text{Attention}(q, \{k_j, v_j\}_{j=1}^n) = \sum_{j=1}^n \alpha_j(q) v_j

where weights αj(q)\alpha_j(q) are typically softmax-normalized similarities between the query qq and keys kjk_j: αj(q)=exp(qTkj/d)i=1nexp(qTki/d)\alpha_j(q) = \frac{\exp(q^T k_j/\sqrt{d})}{\sum_{i=1}^n \exp(q^T k_i/\sqrt{d})} This is precisely a Nadaraya–Watson estimator, a canonical kernel regression formula, with exponential (Gaussian) kernel K(q,kj)=exp(qTkj/d)K(q,k_j) = \exp(q^T k_j/\sqrt{d}). Thus, standard attention computes the estimated conditional mean

y^(q)=jK(q,kj)vjjK(q,kj)\widehat{y}(q) = \frac{\sum_{j} K(q, k_j) v_j}{\sum_{j} K(q, k_j)}

for any choice of (query, keys, values) (Han et al., 2022, Santos et al., 30 Jan 2026). Extensions with alternative kernel functions, such as compact-support or learned kernels, map directly onto modified attention mechanisms and control the sparsity, adaptivity, and locality of the resulting mixture (Santos et al., 30 Jan 2026, Aksenov et al., 2024).

2. Functional and Statistical Properties of Transformer Kernels

In the normalized attention scheme, the kernel K(q,k)K(q, k) can be

  • fixed, as in softmax attention (equivalent to a Gaussian RBF kernel on the sphere for normalized keys/queries),
  • learned via parameterized random feature maps (Chowdhury et al., 2021), or
  • explicitly constructed as in sparse/compact-support kernels (e.g., Epanechnikov, biweight, or polynomial).

Transformers with softmax attention can be viewed as carrying out one or more functional gradient steps in the reproducing kernel Hilbert space (RKHS) defined by the chosen kernel, for tasks such as regression or classification (Dragutinović et al., 12 Oct 2025). This includes context-dependent or meta-learned kernel parameters, such as adaptive bandwidth or context-wise learning rate. In modern extensions, transformer kernels may deviate from positive-definite, symmetric (Mercer) kernels and instead realize non-Mercer, possibly asymmetric, or indefinite kernels, crucial for the expressivity of advanced models (Wright et al., 2021).

Furthermore, ReLU or linear attention variants replace the exponential kernel with other positive kernels, such as ELU(q)+1\text{ELU}(q) + 1 (for nonnegativity), leading to efficient linear-complexity transformers that still act as kernel regressors (Katharopoulos et al., 2020). The explicit use of a kernel function, and whether the feature map is fixed, learned, or adaptive, directly shapes the inductive bias and practical performance.

3. In-Context Learning and Nonparametric Regression with Transformers

Transformers demonstrate efficient in-context learning by approximating minimax-optimal rates for nonparametric regression. For α\alpha-Hölder smooth regression functions with nn input examples and dd-dimensional covariates, a transformer can implement a local polynomial kernel regression: w=argminw1ni=1nKh(Xix)[YiwTPh(Xix)]2w_* = \operatorname{argmin}_{w} \frac{1}{n} \sum_{i=1}^n K_h(X_i - x) [Y_i - w^T P_h(X_i - x)]^2 with the output estimate given by the constant term of ww_*, truncated to a bounded interval (Ching et al., 21 Jan 2026). By constructing transformer blocks that center and normalize input covariates, compute polynomial feature maps, and iteratively perform gradient descent in this weighted least squares objective, a transformer can reach the minimax rate O(n2α/(2α+d))O(n^{-2\alpha/(2\alpha+d)}) in mean squared error, matching classical nonparametric regression—yet with only O(logn)O(\log n) blocks and parameters.

Block-wise, the transformer pipeline involves:

  • Aligning (centering and normalizing) data entries using linear attention,
  • Approximating monomial features and kernel weights via ReLU MLPs,
  • Executing a sequence of gradient descent steps via additional attention blocks,
  • Extracting and truncating the regression estimate via a final feedforward network.

Theoretical results establish that the population risk of such a transformer matches that of the associated local polynomial estimator up to O(1/n)O(1/n) error, under standard smoothness and boundedness conditions (Ching et al., 21 Jan 2026).

4. Sparse, Adaptive, and Learned Kernels: Beyond Softmax Attention

The flexibility of the kernel-regression lens enables structured innovations in transformer attention:

  • Sparse attention: By choosing compact-support kernels, such as the Epanechnikov or biweight kernel K(u)=[1u2/h2]+rK(u) = [1-\|u\|^2/h^2]_+^r, attention weights can be set to zero for distant entries, yielding explicit sparsity and improved scaling with context length (Santos et al., 30 Jan 2026). This includes connections to sparsemax and α\alpha-entmax (α=1+1/r\alpha=1+1/r) attention.
  • Learned kernels: Kernelized Transformers parameterize the spectral distribution in random Fourier feature maps, allowing the kernel shape to be adapted to the training data through end-to-end learning. Bochner's theorem provides a systematic method to realize any shift-invariant positive-definite kernel using random spectral features, and performance gains have been demonstrated on tasks requiring long-range context (Chowdhury et al., 2021).
  • Adaptive bandwidth or margin: Mechanisms such as sparsemax, ReLUmax, or margin-anchored normalization learn the effective support size or threshold for attention within each context, balancing local and global information (Santos et al., 30 Jan 2026).

The table below summarizes several kernel choices and their equivalent attention mechanisms.

Kernel Type Attention Mechanism Sparsity / Adaptivity
Gaussian (softmax) Softmax Attention Dense, smooth
Epanechnikov (r=1r=1) Sparsemax / ReLU Sparse, compact
Biweight (r=2r=2) 1.5-entmax Sparser, smoother
Top-k uniform kNN attention Hard kk-sparse
Learned RFF kernel Parameterized linear Tunable, potentially dense

5. Extensions: Nonlinear, Robust, and Efficient Transformer Kernel Machines

Nonlinear In-Context Learning

Transformers can also emulate kernel regression with nonlinear kernels, such as polynomial kernels. By integrating nonlinear feed-forward (GLU or bilinear) layers, transformers expand their feature representations to include arbitrary monomials. Stacking such layers with attention blocks enables block-coordinate descent in the corresponding RKHS, allowing for efficient in-context regression for a broader class of nonlinear functions (Sun et al., 30 Jan 2025).

Robustness

Self-attention is naturally understood as kernel density estimation (KDE), and robust statistical techniques (M-estimation, scaled-projected KDE, median-of-means) have been integrated into transformers to mitigate data contamination. All such robust mechanisms retain the Nadaraya–Watson form, modifying only the calculation of weights or support (Han et al., 2022).

Algorithmic Efficiency

Linear, performer, and fastfood attention replace the quadratic complexity of softmax attention with linear complexity by exploiting kernel feature map associativity and randomized projection techniques (Katharopoulos et al., 2020, Jenson et al., 2024). These architectures preserve core kernel regression properties while enabling inference on sequence lengths two orders of magnitude longer.

6. Theoretical Underpinnings: Banach Spaces, Universality, and Representer Theorems

Dot-product attention in transformers can be characterized as a kernel over pairs of Banach spaces (RKBS), with an infinite-dimensional, non-Mercer kernel: K(t,s)=exp(WQt,WKs/d)K(t, s) = \exp(\langle W_Q t, W_K s \rangle /\sqrt{d}) The explicit feature map is the collection of all order-nn monomials in queries and keys, and the bilinear kernel need not be symmetric or positive-definite. Results establish a version of the representer theorem and a universal approximation theorem: attention layers can express any continuous bivariate function arbitrarily well on compact domains (Wright et al., 2021). Experimental evidence shows that such infinite-dimensional feature spaces confer generalization benefits, especially in complex multi-domain tasks.

7. Practical Implications and Design Recommendations

The kernel regression view yields principled guidelines for transformer architecture and training:

  • Attention heads and blocks should be viewed as parameterizing both the kernel and the regression algorithm (e.g., gradient descent steps) performed in-context (Ching et al., 21 Jan 2026).
  • Adaptive or learned kernels (e.g., sparsemax, entmax, learned Fourier features) account for context size, data smoothness, and sparsity, improving in-context learning and generalization (Santos et al., 30 Jan 2026, Chowdhury et al., 2021, Aksenov et al., 2024).
  • Modifying the kernel function, not simply the key/query projections, offers a new axis for controlling model inductive bias.
  • The representation power, robustness, and compute/memory efficiency of transformers can be systematically derived from properties of classical kernel machines, as established in nonparametric regression, robust statistics, and spectral learning theory.

In summary, interpreting transformers as kernel regressors establishes a unifying mathematical and algorithmic foundation for understanding both attention and in-context learning, connecting modern deep learning to classical nonparametric estimation and providing a concrete framework for principled transformer architecture design (Ching et al., 21 Jan 2026, Dragutinović et al., 12 Oct 2025, Jenson et al., 2024, Katharopoulos et al., 2020, Chowdhury et al., 2021, Sun et al., 30 Jan 2025, Wright et al., 2021, Aksenov et al., 2024, Han et al., 2022, Santos et al., 30 Jan 2026).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Transformers as Kernel Regressors.