Single-Head Softmax-Attention Regressor

Updated 30 January 2026

The paper presents a formal characterization of the single-head model, establishing exact learnability via adaptive query protocols.
It details a two-phase recovery algorithm with O(d²) queries and extends to low-rank regimes using compressed sensing techniques.
The analysis connects the regressor to kernel methods like the Nadaraya–Watson estimator, showcasing its statistical optimality and limitations.

A single-head softmax-attention regressor denotes the most basic variant of regression modeling using the attention mechanism typical of Transformer architectures, with only one attention head and a softmax normalization. This architecture admits a precise mathematical characterization, exhibits exact learnability under controlled query protocols, provides nonparametric regression capabilities, and encapsulates both the expressivity and limitations inherent in softmax attention. The following sections present a comprehensive technical account.

1. Formal Definition and Model Architecture

A single-head softmax-attention regressor $f_{W,v}$ comprises:

A $d \times d$ “merged” query–key matrix $W \in \mathbb{R}^{d \times d}$ .
A “merged” value/output vector $v \in \mathbb{R}^d$ .

Given an input sequence $X = [x_1^\top ; x_2^\top ; \cdots ; x_N^\top ] \in \mathbb{R}^{N \times d}$ , the last token $x_N$ serves as the query. The unnormalized scores are:

$s_i(X; W) = x_i^\top W x_N, \quad i=1,\ldots,N,$

and the softmax attention weights:

$\alpha_i = \frac{\exp(s_i)}{\sum_{j=1}^N \exp(s_j)}.$

The scalar output is the weighted sum:

$f_{W,v}(X) = \sum_{i=1}^N \alpha_i (v^\top x_i) = \alpha^\top [x_1, \ldots, x_N] v,$

or equivalently,

$\alpha = \mathrm{softmax}(X W x_N), \qquad f_{W,v}(X) = \alpha^\top (X v).$

This formalization coincides with the “dot-product attention + softmax” mechanism used in Transformers, but specialized to the regression setting (Bhattamishra et al., 23 Jan 2026).

2. Exact Learnability via Adaptive Querying

The learnability properties of single-head softmax-attention regressors have been rigorously characterized under a black-box oracle model. The core results include:

Two-Phase Recovery Algorithm: The parameters $(W, v)$ $(W, v)$ can be exactly recovered with $O(d^2)$ $O (d^{2})$ queries:
1. Phase 1—Recover $v$ by applying $d$ singleton queries (e.g., $X = [e_i^\top]$ ) to read out each coordinate.
2. Phase 2—Recover $W$ column-by-column: For each column $j$ , use $d$ linearly independent probe vectors $u_1, \ldots, u_d$ in two-row sequences to linearly identify $w_j = W e_j$ .
Query Complexity: $d$ queries for $v$ , $d \times d$ queries for $W$ ; total $O(d^2)$ .
Low-Rank Regime: If $\mathrm{rank}(W) \le r \ll d$ , random rank-one probes enable compressed sensing recovery with $O(rd)$ queries, applying nuclear norm minimization (Bhattamishra et al., 23 Jan 2026).
Robustness to Noise: Under $\|W\|_F \le W$ , $\|W\|_2 \le 1$ and $\min_i |v_i| \ge \mu > 0$ , the same recovery procedure achieves $\varepsilon$ -accurate estimation with a polynomial number of queries under value noise $|\mathrm{AVQ}(X; \tau) - f(X)| \le \tau$ (Bhattamishra et al., 23 Jan 2026).

3. Functional and Statistical Interpretations

Single-head softmax-attention regressor admits precise connections to kernel and nonparametric regression:

Nadaraya–Watson Estimator: The output

$\widehat{f}_{\mathrm{soft}}(q) = \sum_{i=1}^n \frac{\exp(q^\top k_i / \tau)}{\sum_j \exp(q^\top k_j / \tau)} v_i$

is a local constant estimator with exponential kernel, coinciding with the Nadaraya–Watson regression (Zuo et al., 1 Oct 2025).

Bias–Variance Tradeoff: As a local constant estimator, the bias scales as $O(h^2)$ and variance as $O(1/(n h^d))$ . The optimal MSE decays as $n^{-2/(d+2)}$ , which can be improved by local linear extensions but not by the standard softmax regressor alone (Zuo et al., 1 Oct 2025).
Universal Approximation: A one-layer softmax-attention regressor with sufficient width and grid resolution implements a truncating piecewise-linear regressor to arbitrary precision; the approximation error is controlled by the anchor grid size and softmax temperature (Hu et al., 22 Apr 2025).

4. In-Context Learning and Weight Shifting Equivalence

Single-head softmax attention regressor closely aligns with gradient descent on the normalized exponential regression problem:

$x^* = \arg\min_{x \in \mathbb{R}^d} \left\| \alpha(x)^{-1} \exp(A x) - b \right\|_2$

where $\alpha(x) = \langle \exp(A x), 1_n \rangle$ and $b \in \mathbb{R}^n$ are in-context labels (Li et al., 2023).

A single self-attention layer induces a data shift whose effect on the regression prediction is Lipschitz-bounded and nearly identical to a small gradient step in parameter space:

$\|\Delta_b^{\mathrm{Att}}\|_2 \le M \|A_{t+1} - A_t\|,\qquad \|\Delta_b^{\mathrm{GD}}\|_2 \le M \|x_{t+1} - x_t\|_2$

with $M$ polynomial in $n, R, \exp(10R^2)$ .

This establishes a quantitative sense in which in-context attention implements a “weight-shift” analogous to actual parameter updates via gradient descent (Li et al., 2023).

5. Extensions, Limitations, and Identifiability

While single-head softmax attention regressors enjoy exact identifiability and learnability, there are sharp limitations:

Multi-Head Identifiability Failure: For $H > 1$ heads, parameters cannot be uniquely recovered from value queries; for any $(W, v)$ and a probability vector $\lambda$ , one may set $W^{(h)} = W$ , $v^{(h)} = \lambda_h v$ and sum outputs to recover the same function (Bhattamishra et al., 23 Jan 2026). Guarantees analogous to single-head learning are impossible without additional constraints, e.g., orthogonality.
Expressivity vs Learnability: While single-head softmax attention can represent complex Boolean functions (e.g., $k$ -bit $\mathrm{AND}$ / $\mathrm{OR}$ for $k = \Theta(d)$ ), learnability is contingent on supervised hints (“teacher forcing”). One gradient step suffices under intermediate supervision, but no polynomial-time algorithm learns these functions end-to-end without such hints (Hu et al., 26 May 2025).
Statistical Optimality: In specific regimes, e.g., single-location regression, softmax attention achieves Bayes-optimal risk, strictly outperforming linear attention and component-wise alternatives. Its advantage persists both at the population and finite-sample levels due to its global normalization and exponential selectivity (Duranthon et al., 26 Sep 2025).
Scaling and Practical Considerations: For regression with high-dimensional input or large context ( $d$ ), query and computation costs scale quadratically unless exploiting low-rank structure. Memory-efficient primitives and blockwise computation extend practical applicability (Zuo et al., 1 Oct 2025).

6. Comparative Analysis and Theoretical Guarantees

The following table summarizes key theoretical properties and scaling results:

Property	Scaling / Guarantee	Reference
Exact recovery (general $W, v$ )	$O(d^2)$ queries	(Bhattamishra et al., 23 Jan 2026)
Low-rank recovery ( $\mathrm{rank}(W) \le r$ )	$O(r d)$ queries via compressed sensing	(Bhattamishra et al., 23 Jan 2026)
Robust recovery under noise	$\varepsilon$ -accurate; polynomial queries	(Bhattamishra et al., 23 Jan 2026)
Bias-variance MSE rate (softmax)	$n^{-2/(d+2)}$ (Nadaraya–Watson constant estimator)	(Zuo et al., 1 Oct 2025)
Universal approximation (trunc-RELU)	Arbitrarily close via anchor grid and temperature tuning	(Hu et al., 22 Apr 2025)
Multi-head identifiability	Not possible from value queries alone	(Bhattamishra et al., 23 Jan 2026)
Finite-sample generalization (Bayes-optimal)	Softmax achieves Bayes risk; linear falls short	(Duranthon et al., 26 Sep 2025)

7. Practical Implications and Open Directions

Single-head softmax-attention regressors represent an analytically tractable subclass of attention-based models, perfectly suited for theoretical investigations of in-context regression, expressivity, statistical optimality, and identifiability. Their performance is dictated by the interplay of architecture (number of heads), context dimensionality, data noise, and activation choice.

Integration with Feedforward Networks and Deeper Architectures: When algorithms for learning ReLU FFNs become available, single-head methods can be extended to learn one-layer Transformers with single-head attention (Bhattamishra et al., 23 Jan 2026).
Nonparametric Kernel Extensions and Bias Reduction: Local linear variants (LLA) provably improve bias-order and overall MSE, suggesting an avenue for developing higher-order attention regressors with enhanced statistical efficiency (Zuo et al., 1 Oct 2025).
Theoretical Limits in Sample, Time Complexity, and Expressivity: The gap between expressivity and end-to-end learnability—articulated for Boolean functions—remains central to ongoing research on the fundamental capabilities and limits of minimalist attention architectures (Hu et al., 26 May 2025).
Identifiability and Structural Constraints in Multi-Head Attention: The non-uniqueness of multi-head parameterizations implies the necessity of additional conditions (such as orthogonality) for parameter recovery and robust regime analysis.

The single-head softmax-attention regressor thus anchors theoretical understanding of Transformer-based regression, bridging controlled algorithmic learning, nonparametric statistical foundations, and the practical limits imposed by model design.