Ordinary Least Squares is a Special Case of Transformer

Published 15 Apr 2026 in cs.LG, cs.AI, math.ST, and stat.ML | (2604.13656v1)

Abstract: The statistical essence of the Transformer architecture has long remained elusive: Is it a universal approximator, or a neural network version of known computational algorithms? Through rigorous algebraic proof, we show that the latter better describes Transformer's basic nature: Ordinary Least Squares (OLS) is a special case of the single-layer Linear Transformer. Using the spectral decomposition of the empirical covariance matrix, we construct a specific parameter setting where the attention mechanism's forward pass becomes mathematically equivalent to the OLS closed-form projection. This means attention can solve the problem in one forward pass, not by iterating. Building upon this prototypical case, we further uncover a decoupled slow and fast memory mechanism within Transformers. Finally, the evolution from our established linear prototype to standard Transformers is discussed. This progression facilitates the transition of the Hopfield energy function from linear to exponential memory capacity, thereby establishing a clear continuity between modern deep architectures and classical statistical inference.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper establishes that the forward pass of a linear Transformer exactly computes the OLS projection via spectral decomposition.
It empirically validates that the configured model recovers the OLS solution in canonical regression tasks within 1000 epochs.
The work highlights sensitivity to data distribution shifts, prompting architectural enhancements for robust, context-aware AI.

Structural Isomorphism Between Linear Transformer and Ordinary Least Squares

Analytical Equivalence and Theoretical Foundations

The paper "Ordinary Least Squares is a Special Case of Transformer" (2604.13656) establishes a rigorous algebraic correspondence between the closed-form ordinary least squares (OLS) estimator and the forward pass of a single-layer Linear Transformer. The authors prove that the OLS projection $X(X^TX)^{-1}X^T$ can be realized by an exact parameter configuration within the Linear Transformer architecture, specifically through spectral decomposition of the empirical covariance matrix and strategic construction of the query, key, and value matrices.

The key insight is that, omitting nonlinearities and normalization (such as Softmax), linear attention computes a direct projection that matches the OLS solution in one pass, not through iterative optimization. Through spectral decomposition, the authors construct $L = VA^{-1/2}$ such that the OLS estimator decomposes as $Y = XLP$ , making the forward pass of a properly parameterized Linear Transformer mathematically isomorphic to OLS. This is formalized by configuring $W_Q = W_K = W_V = L$ , $W_{FFN} = I$ , and $W_p = P$ , where $P$ encodes task-specific information. Thus, under these choices, attention is not merely an approximation or simulation of an optimization process but an explicit analytical projection operator.

Empirical Validation: Functional and Structural Convergence

The authors conduct empirical experiments on canonical linear regression (e.g., $y = 2x + \epsilon$ ) to validate reachability and stability of this parameter configuration under gradient-based optimization. Unlike overparameterized models known for masking structural emergence, the OLS-Transformer recovers the exact inverse covariance structure required for OLS, with rapid synchronized convergence of training loss and structural alignment of $L$ to its analytical optimum $(1/X^TX)^{-1/2}$ . The model output aligns with the OLS benchmark within 1000 epochs, demonstrating both functional convergence (zero relative distance to OLS solution) and structural emergence of the theoretically optimal memory term.

Memory Decoupling: Slow and Fast Mechanisms

A theoretical contribution is the decoupling of slow and fast memory within the OLS-Transformer architecture. Here, slow memory corresponds to parameters learned during training (matrix $L = VA^{-1/2}$ 0; inverse covariance), while fast memory arises from real-time construction of attention scores at inference. This dichotomy echoes Hebbian-like principles and aligns with associative memory models: slow memory encodes enduring statistical structures from the training set, while fast memory allows instantaneous contextual adaptation.

Crucially, the OLS-Transformer is sensitive to shifts in data distribution, in contrast to static OLS predictions. If inference context matches training distribution, predictions recover the OLS solution exactly; otherwise, predictions are linearly distorted by mismatch in covariance structure. This exposes both the contextual capability and inherent susceptibility to distribution shifts, underscoring the need for subsequent architectural enhancements for robust generalization.

Evolution Towards Standard Transformer and Associative Memory

The transition from OLS-Transformer to the standard Transformer encompasses several modifications: nonlinear activations, separate Q/K/V parameterization, Softmax attention, multi-head structure, and positional encodings. The most significant is the adoption of Softmax for attention score normalization. This shift is interpreted as evolution of the Hopfield energy function—from quadratic (linear attention/OLS) to exponential (Softmax)—corresponding to a leap from linear to exponential memory capacity in associative memory networks.

The connection between Transformer attention, Hopfield networks, and dense associative memory is elucidated: original Hopfield networks had limited storage, later expanded via polynomial and exponential energy functions. Softmax attention realizes continuous-state Hopfield networks with exponential capacity, thus standard Transformers inherit and vastly extend associative memory capabilities embedded in linear projections.

Implications and Theoretical Significance

By constructing the OLS-Transformer, the paper demonstrates that Transformer architectures possess innate analytical grounding in classical statistical inference, not merely functioning as universal function approximators or simulation engines. The result demystifies the "black-box" nature of attention and positions context-aware reasoning and memory as emergent phenomena rooted in basic statistical operators.

However, the pronounced sensitivity to data distribution shifts in the OLS-Transformer signals the necessity for nonlinear, probabilistic, and more flexible designs in modern Transformers. These augmentations—Softmax, multi-heads, deeper stacking—are not arbitrary but are theoretically motivated evolutionary steps addressing robustness and expressivity in practical AI.

Speculation on Future Directions

The algebraic framework advanced here invites systematic exploration of attention mechanisms parameterized by higher-order or exponential energy functions—beyond linear and Softmax variants—for improved trade-offs among computational efficiency, memory density, and interpretability. This direction aims to establish principled design paradigms for AI architectures, leveraging foundational statistical principles and associative memory theory.

Conclusion

The formal isomorphism between OLS and Linear Transformer, as demonstrated, provides a theoretical foundation for interpreting Transformer architectures as explicit statistical projectors. The result clarifies the context-sensitive and memory-rich capabilities of modern deep architectures as directly grounded in classical inference methods. Evolutionary extensions—nonlinear activations, probabilistic attention, and associative memory—are shown to be critical for robustness and generalization under complex data conditions. The work lays the groundwork for designing interpretable and theoretically principled AI systems that balance efficiency, capacity, and adaptability.

Markdown Report Issue