Papers
Topics
Authors
Recent
Search
2000 character limit reached

Linear Attention in Transformers

Updated 23 January 2026
  • Linear attention is a family of Transformer-style mechanisms that reduce quadratic complexity to linear by using kernel feature maps instead of softmax.
  • These methods address challenges such as non-injectivity and low-rank outputs through injective modifications and rank augmentation techniques.
  • They enable efficient, scalable modeling across NLP, vision, and scientific computing with notable speed-ups and memory savings.

Linear attention refers to a family of Transformer-style attention mechanisms that achieve linear computational and memory complexity in the sequence length, contrasting with the quadratic complexity of standard softmax attention. Linear attention methods replace the softmax-based similarity measure with a composition of kernelized feature maps, structural low-rank approximations, or recurrent algebraic formulations, allowing scalable modeling of long sequences. The domain has evolved rapidly, addressing key theoretical and practical deficits to close the empirical gap with softmax-based attention in vision, language, and scientific computing.

1. Mathematical Formulation and Core Algorithms

In standard self-attention, the attention output for a query-key-value triple (Q,K,V)(Q, K, V) is computed via: Asoft(Q,K,V)=Softmax(QKd)VA_{\rm soft}(Q, K, V) = \operatorname{Softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)V where Q,K,VRN×dQ, K, V \in \mathbb{R}^{N \times d}, and NN is the number of tokens. This requires explicit formation of the N×NN \times N attention matrix, resulting in O(N2d)O(N^2 d) time and O(N2)O(N^2) memory.

Linear attention replaces the softmax kernel with a non-negative feature map ϕ:RdRr\phi:\mathbb{R}^d \to \mathbb{R}^r (frequently r=dr=d), yielding: Alin(Q,K,V)=ϕ(Q)(ϕ(K)V)A_{\rm lin}(Q, K, V) = \phi(Q) \left( \phi(K)^\top V \right) or, in normalized form,

Oi=ϕ(Qi)j=1Nϕ(Kj)Vjϕ(Qi)j=1Nϕ(Kj)O_i = \frac{\phi(Q_i) \sum_{j=1}^N \phi(K_j)^\top V_j}{\phi(Q_i) \sum_{j=1}^N \phi(K_j)^\top}

This exploits the associativity of matrix multiplication to avoid constructing large intermediate matrices, reducing complexity to O(Nd2)O(N d^2) in typical settings (Han et al., 2024, Li et al., 2020).

Specific instantiations include:

  • Kernel-based linear attention: Softmax kernel eqke^{q^\top k} is approximated via kernel feature maps, such as 1+q~k~1 + \tilde{q}^\top \tilde{k} using first-order Taylor expansion and L2L_2 normalizations (Li et al., 2020).
  • Feature map selection: Choices like ReLU, ELU+1+1, or random feature projections (Performer/FAVOR+) are prominent (Zheng, 27 Jan 2025, Han et al., 2024).
  • Depthwise convolutional augmentation: To restore expressiveness lost by low-rank kernelization, many models integrate depthwise convolutional branches (Han et al., 2023, Han et al., 2024).
  • Low-rank intermediates: Agent Attention parameterizes attention via a bottleneck set of "agent" tokens, yielding a form O=ϕ(q)(Q)[ϕ(k)(K)]VO = \phi_{(q)}(Q)[\phi_{(k)}(K)]^\top V with nNn \ll N (Han et al., 2023).

2. Theoretical Properties, Limitations, and Remedies

Linear attention methods exhibit unique structural properties relative to softmax attention, presenting both theoretical strengths and limitations:

Non-injectivity and rank-deficiency:

  • The mapping QQ \mapsto attention is not injective for generic linear kernel maps: distinct queries can induce identical output distributions due to scalar invariance (Han et al., 2024). Theoretical proofs confirm softmax is injective under mild rank assumptions, while kernelized attention is not.
  • Linear attention suffers from low-rank output: the "KV buffer" formed by jκ(Kj)Vj\sum_j \kappa(K_j)^\top V_j has rank at most dd and is often empirically much lower, suppressing feature diversity in outputs (Fan et al., 2024).

Flatness and focus deficiency:

Magnitude neglect:

  • Linear attention is invariant to query magnitude; scaling QQ leaves attention weights unchanged, contrary to softmax where increasing Q||Q|| sharpens the distribution (Fan et al., 1 Jul 2025).

Remedies include:

  • Injective Linear Attention (InLine): A zero-sum normalization makes the QQ \mapsto attention function injective, restoring uniqueness of outputs (Han et al., 2024).
  • Rank-Augmented Linear Attention (RALA): A channelwise modulation and weighting scheme for the KV buffer breaks degeneracy and elevates output rank, empirically matching softmax in expressiveness (Fan et al., 2024).
  • Concentration modules (LCM, DWC): Lightweight depthwise convolution over outputs restores local peakiness and token-specific diversity lost by naive kernelization (Zheng, 27 Jan 2025, Han et al., 2023).
  • Magnitude-aware normalization: MALA introduces a β\beta scaling and γ\gamma shift to the attention computation, admitting dynamic adaptivity to query scales and mimicking the behavior of softmax (Fan et al., 1 Jul 2025).

3. Algorithmic Variants and Hardware Efficiency

Multiple design strategies operationalize linear attention:

  • Prefix-sum and running-state formulations:
  • Sparse LinAttn:
  • Optimized kernels:
    • CUDA and Triton implementations fuse prefix-scan logic with hardware-efficient reductions, reducing both latency (3.3× vs prior) and peak memory (3.6× lower) in LLMs (e.g., Pythia-1.4B) (Gerami et al., 24 Oct 2025).
  • Augmentation for speculative/parallel decoding:
    • Depthwise-convolutional branches, grouped prefix-sum states, and blockwise computation enable linear attention to interoperate with speculative decoding algorithms and maintain causality/performance in LLMs (You et al., 2024).

4. Linear Attention in Computer Vision and Language

Linear attention mechanisms have been deployed in major Transformer architectures for both vision and language:

  • Vision Transformers (ViT, DeiT, Swin, PVT):
    • L2^2ViT alternates windowed softmax attention with enhanced linear attention blocks, leveraging local concentration modules to balance global and local context (Zheng, 27 Jan 2025).
    • Agent Attention factors global attention through a handful of agent tokens, preserving global modeling at O(NndN n d) cost (Han et al., 2023).
    • Focused Linear Attention (FLatten) sharpens kernel feature angles and adds depthwise convs, empirically closing the gap in ImageNet classification and COCO detection (Han et al., 2023).
  • LLMs:
    • Recent distillation protocols (RADLADS) rapidly convert softmax-based LLMs to RWKV-style linear decoders at all scales, matching or closely tracking original teacher accuracy while reducing parameter, memory, and inference requirements (Goldstein et al., 5 May 2025).
    • MetaLA unifies linear attention architectures under a single theoretical framework, proposing an "optimal" design based on minimal parameterization, dynamic memory, and static expressivity (Chou et al., 2024).

Across both domains, linear attention allows modeling large contexts (up to 128K tokens (Zhang et al., 2023)), faster inference, and significant peak memory reductions—often with minimal or no loss in core downstream metrics.

5. Quality-Efficiency Trade-offs and Empirical Results

Empirical evaluations systematically benchmark linear attention against softmax attention across computer vision, language modeling, and scientific domains:

Computer Vision:

Model Top-1 (%) Params (M) FLOPs (G) Relative to Softmax
Agent-DeiT-T 74.9 1.2 +2.7 p.p.
L2^2ViT-Base 84.4 89 15.9 +0.9 p.p. vs Swin-B
RAVLT-S (RALA) 84.4 26 4.6 ≥Swin-B
FLatten-DeiT-Tiny 74.1 6.1 1.1 +1.9 p.p.

Ranking and ablation studies repeatedly show that modern rank-augmented, injective, or convolutionally-modulated linear attentions can close the performance gap to softmax or surpass it, with consistent reductions in inference cost and memory (Fan et al., 2024, Han et al., 2023, Han et al., 2023).

Language Modeling:

6. Applications Beyond NLP and Computer Vision

Linear attention has enabled new advances in domains beyond traditional machine learning benchmarks:

  • Neural operators for PDEs: Linear kernelization generalizes earlier "Physics-Attention" structures, achieving both state-of-the-art accuracy and 30–40% reductions in compute and parameter count in PDE surrogates (e.g., Airfoil, AirfRANS, Shape-Net Car) (Hu et al., 9 Nov 2025).
  • Learned image compression: Bi-RWKV blocks with linear attention, spatial-channel mixing, and convolutional shifts yield superior BD-rate reductions on Kodak, Tecnick, and CLIC, outperforming other learned compressors at significantly lower memory (Feng et al., 9 Feb 2025).
  • Unbounded-context modeling: Orthogonal memory decomposition (LAVO) allows linear scaling to $128$K-tokens language modeling, preserving extrapolation and matching or exceeding competing methods’ perplexity (Zhang et al., 2023).

7. Outlook and Open Directions

Ongoing research addresses several open problems and avenues in linear attention:

  • Expressivity: Enhanced kernels (e.g., higher-order moments (Zhang et al., 31 Oct 2025)), rank-boosting (Fan et al., 2024), or hybrid sparse-dense mixtures (Lee et al., 2023) further improve the functional richness of linear attention.
  • Stability: Careful normalization (e.g., the MALA β\beta,γ\gamma scheme (Fan et al., 1 Jul 2025)) and nonnegative feature mappings are essential to avoid pathological degeneracies.
  • Efficiency and hardware optimization: Blockwise, fused-kernel and chunk-parallel training/inference are critical for realizing the theoretical gains of linear attention on modern accelerators (Gerami et al., 24 Oct 2025, Zhang et al., 31 Oct 2025).
  • Theoretical characterization: Training dynamics and fixed-point analyses show that parametrization choices (merged vs. separate Q/K) critically affect optimization pathologies and rates of in-context learning (Zhang et al., 27 Jan 2025).
  • Generalization across modalities: Linear attention has demonstrated competitive performance in speech, dense prediction, time-series, and scientific modeling, suggesting a broad applicability when engineered with the required domain-specific augmentations (Fan et al., 1 Jul 2025, Zhang et al., 2023).

The ongoing evolution of linear attention situates it as a key enabler for scaling sequence models across disciplines while maintaining computational tractability (Han et al., 2023, Fan et al., 2024, Fan et al., 1 Jul 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Linear Attention.