Papers
Topics
Authors
Recent
Search
2000 character limit reached

Linear Attention Variants Overview

Updated 9 February 2026
  • Linear attention variants are efficient mechanisms that use kernel approximations to replace quadratic softmax, enabling scalable long-range modeling.
  • They employ techniques such as kernelization, gating, and hierarchical memory to achieve linear or subquadratic runtimes while reducing memory usage.
  • These methods have practical applications in NLP, vision, and scientific computing, balancing speed and efficiency with some trade-offs in expressiveness.

Linear attention variants constitute a broad and rapidly evolving class of attention mechanisms designed to circumvent the quadratic complexity bottleneck of softmax-based attention, enabling the efficient modeling of long-range dependencies in sequence models, transformers, and neural operators. These methods replace or approximate the standard attention kernel or memory layout, yielding subquadratic—often strictly linear—runtime and memory with respect to sequence length. Linear attention variants span from fixed or kernelized low-rank factorization and data-dependent gating, to hierarchical, agent-based mediation and log-linear memory layouts. This article systematically catalogs principal linear attention designs, their mathematical foundations, empirical impact, theoretical limits, and open questions, richly referencing recent literature.

1. Canonical Linear Attention: Kernelization and Feature Maps

The core principle of linear attention is to replace the quadratic dot-product attention softmax(QK/d),\mathrm{softmax}(QK^\top/\sqrt{d}), which requires explicit computation and storage of the N×NN\times N attention matrix, with a kernel function that admits an associative factorization. This is typically achieved via a feature mapping ϕ:RdRr\phi:\mathbb{R}^d\rightarrow\mathbb{R}^r such that

exp(QiKj/d)ϕ(Qi)ϕ(Kj),\exp(Q_i K_j^\top/\sqrt{d}) \approx \phi(Q_i)^\top \phi(K_j),

yielding the attention output, per token ii,

A(Q,K,V)i=ϕ(Qi)j=1Nϕ(Kj)Vjϕ(Qi)m=1Nϕ(Km).A(Q,K,V)_i = \frac{\phi(Q_i) \sum_{j=1}^N \phi(K_j)^\top V_j}{\phi(Q_i) \sum_{m=1}^N \phi(K_m)^\top}.

Reordering allows for O(Nd2)O(Nd^2) computation, as jϕ(Kj)Vj\sum_j \phi(K_j)^\top V_j can be precomputed and shared across all queries (Fan et al., 1 Jul 2025, Nahshan et al., 2023).

This design encompasses:

Primitive linear attention suffers in expressivity and concentration versus softmax, often yielding under-concentrated attention maps and degraded accuracy.

2. Gated, Focused, and Magnitude-Aware Linear Attention

Several improvements address the limitations of kernel linear attention by enhancing selectivity, contextualization, or distributional fidelity.

Focused Linear Attention (FLA) introduces a learned “focus” kernel via

ϕp(x)=fp(ReLU(x)),fp(x)i=x2xipxp2,\phi_p(x) = f_p(\mathrm{ReLU}(x)), \quad f_p(x)_i = \Vert x \Vert_2 \frac{x_i^p}{\Vert x^p \Vert_2},

sharpening similarity structure, combined with a depthwise convolutional bias and a gated MLP applied post-attention. FLA is adopted in speech separation models (FLA-SepReformer and FLA-TFLocoformer), demonstrating 1.5–2.3× speedup and up to a 5× reduction in memory usage (depending on model size), while retaining near SOTA SI-SNRi/SDRi (Wang et al., 27 Aug 2025).

Magnitude-Aware Linear Attention (MALA) targets the “magnitude neglect” property of conventional variants—the cancellation of ϕ(Qi)\Vert\phi(Q_i)\Vert in both numerator and denominator—which leads to flat attention scaling under query norm changes:

A(Q,K,V)i=αiSαiz,αi=ϕ(Qi)/ϕ(Qi).A(Q,K,V)_i = \frac{\vec\alpha_i S}{\vec\alpha_i z}, \quad \vec\alpha_i = \phi(Q_i)/\Vert\phi(Q_i)\Vert.

MALA introduces an additive normalization and a per-query scale parameter β\beta, restoring softmax-like sharpness dynamics:

Yi=βϕ(Qi)SγjVj,Y_i = \beta \phi(Q_i) S - \gamma \sum_j V_j,

where β=1+1/(ϕ(Qi)mϕ(Km))\beta = 1 + 1/(\phi(Q_i) \sum_m \phi(K_m)^\top), γ=(ϕ(Qi)mϕ(Km))/N\gamma = (\phi(Q_i) \sum_m \phi(K_m)^\top)/N. This correction brings MALA’s attention distribution much closer to softmax, empirically closing the accuracy gap for vision and language tasks (Fan et al., 1 Jul 2025).

Linear Log-Normal Attention (LLN) reconstructs the distributional statistics of softmax attention by matching the log-normal shape of its entries through exponential feature maps:

ΦQ(q)=exp(αq),ΦK(k)=exp(βk)\Phi_{\mathcal Q}(q) = \exp(\alpha q), \quad \Phi_{\mathcal K}(k) = \exp(\beta k)

with hyperparameters (α,β)(\alpha, \beta) chosen via variance matching. LLN achieves attention concentration behavior closely paralleling softmax but with O(Nd2)O(Nd^2) complexity, outperforming other linearized alternatives on GLUE, LRA, and imaging benchmarks, especially when combined with local softmax on blocks (Nahshan et al., 2023).

3. Advanced Mechanisms: Memory Expansion, Agents, and Hierarchy

Linear attention mechanisms can be significantly extended using hierarchical, agent, and higher-order structures.

Log-Linear Attention replaces the single fixed-size hidden state of a standard linear recurrent update with O(logT)O(\log T) states at different temporal resolutions. Each token update distributes context into multiple Fenwick tree–style aggregation buckets, resulting in per-token compute and memory of O(logT)O(\log T) and aggregate compute of O(TlogT)O(T\log T). Applied to architectures like Mamba-2 and Gated DeltaNet, log-linear variants dramatically improve long-range recall and out-of-distribution generalization (Guo et al., 5 Jun 2025).

Higher-Order Linear Attention (HLA) generalizes linear attention by maintaining streaming sufficient statistics for higher-degree polynomials of queries and keys, e.g., second-order moments:

ot=qtStKCtQV,StK=itkiki,CtQV=itqivi.o_t = q_t^\top S_t^K C_t^{QV},\quad S_t^K = \sum_{i \le t} k_i k_i^\top,\quad C_t^{QV} = \sum_{i \le t} q_i v_i^\top.

This enables expressivity beyond simple pairwise mixing, with only O(d2+ddv)O(d^2 + d d_v) per-token cost. Masked and higher-order variants are constructed via associative scan algorithms (Zhang et al., 31 Oct 2025).

Agent-based Mechanisms (LANO) utilize a small number MNM \ll N of “agent” tokens that mediate global information exchange between all NN sequence positions. LANO constructs a two-stage attention effect, first aggregating from all keys (softmaxed) to each agent, then mediating from agents to sequence queries. This scheme matches softmax-level expressivity, supports universal approximation of integral operators, and achieves significant performance gains (19.5% average error reduction) on PDE benchmarks (Zhong et al., 19 Oct 2025).

4. Learnable Feature Maps and Hybrid Variants

A recent thrust in linear attention research is eliminating the reliance on fixed kernel feature maps by fully learning ϕ\phi.

LUNA (Linear Universal Neural Attention) parameterizes the kernel feature map with a composition of MLP “channel” functions, linear projections, and token-wise envelopes. This ensures the resulting kernel remains positive-definite and adaptive to the data geometry, overcoming the representational limitations of static random features. LUNA matches or exceeds prior efficient transformer accuracy on LRA and, after a brief fine-tuning phase, recovers >99% of a pretrained BERT’s or ViT’s accuracy in post-hoc conversion experiments. Theoretical guarantees underpin LUNA, establishing both universal approximation and explicit generalization error bounds (Shahbazi et al., 8 Dec 2025).

Hybrid Linear–Full Attention strategies, wherein linear attention layers are interleaved with occasional softmax layers, allow for near-quadratic recall at a fraction of the compute/memory cost. Systematic analysis confirms that, for language modeling, quality is stable across wide hybridization ratios, but recall (diagnosed via “RULER” tasks) saturates only when at least one full softmax layer per 3–6 linear layers is included (e.g., HGRN-2 hybrid at $6:1$ or GatedDeltaNet hybrid at $3:1$ ratios) (Wang et al., 8 Jul 2025). However, there exists a provable expressiveness hierarchy: for multi-step function composition tasks, full attention nets with (L+1)(L+1) layers strictly outperform any hybrid network with L1L-1 full attention layers and even exponentially many linear layers (Ye et al., 2 Feb 2026).

5. Efficient and Interpretable Implementations

Hardware-efficient algorithms such as FLASHLINEARATTENTION and its generalization to matrix-gated recurrences (GLA), or Tiled Flash Linear Attention (TFLA), address not only asymptotic complexity but also practical wall-clock throughput and GPU memory use. By chunkwise and tiled parallelization, careful state materialization, and memory-movement minimization, these implementations achieve 2–4× speedup over even optimized softmax kernels (FlashAttention-2/3), especially on long sequences (up to 32K tokens) (Yang et al., 2023, Beck et al., 18 Mar 2025).

Distillation and interpretability are increasingly important. Distillation protocols (e.g., RADLADS) allow rapid conversion of pretrained softmax models into linear-attention decoders (e.g., RAD-RWKV6/7), with minimal performance loss and large speed gains at scale (Goldstein et al., 5 May 2025). Similarly, SEA (Sparse Linear Attention with Estimated Mask) uses efficient kernel estimation and learned sparse masking to offer both interpretability (full or approximate attention matrices) and memory/compute efficiency comparable to standard kernelized models (Lee et al., 2023).

6. Theoretical Guarantees, Limitations, and Design Criteria

Theoretical analysis clarifies both opportunities and inherent trade-offs:

  • Optimal Linear Approximations: MetaLA offers a unified optimal framework, showing that dynamic memory (gating), static approximation (via queries plus decay), and parameter minimality are jointly achievable via an RNN-like update: St=diag(αt)St1+(1αt)vtS_t = \mathrm{diag}(\alpha_t) S_{t-1} + (1-\alpha_t)^\top v_t.
  • Universal Approximation: Agent-based (LANO), learned-kernel (LUNA), and hierarchical mechanisms can, under suitable parameterization, approximate general integral operators or sequence mixing kernels to arbitrary precision (Zhong et al., 19 Oct 2025, Shahbazi et al., 8 Dec 2025).
  • Expressiveness Bounds: No linear attention model, regardless of width or depth, matches the compositional reasoning power of even a modestly deep full attention network (Ye et al., 2 Feb 2026).

Principal limitations of most linear variants include:

  • Reduced ability to model sharply concentrated distributions (“spikiness”) compared to softmax, though MALA/LLN and some learnable kernel approaches mitigate this;
  • Insufficient memory or capacity for exact recall over very long or highly multi-hop contexts, unless augmented with hybrid, hierarchical, or log-linear structures (Wang et al., 8 Jul 2025, Guo et al., 5 Jun 2025);
  • Potential for instability or approximation error if the feature map is poorly chosen or learned without explicit PD constraints (Shahbazi et al., 8 Dec 2025, Nahshan et al., 2023).

7. Empirical Performance and Application Domains

Experimental evaluations consistently demonstrate that state-of-the-art linear attention variants (especially those employing gating, learnable kernels, or log-linear memory) achieve performance on par with, or exceeding, softmax attention on tasks such as speech separation (Wang et al., 27 Aug 2025), PDE solvers (Zhong et al., 19 Oct 2025, Hu et al., 9 Nov 2025), language modeling (Chou et al., 2024, Wang et al., 8 Jul 2025, Yang et al., 2023), image classification (Nahshan et al., 2023, Fan et al., 1 Jul 2025, Shahbazi et al., 8 Dec 2025), and time-series forecasting as structured VAR models (Lu et al., 11 Feb 2025). Architectures like FLA-TFLocoformer and FLA-SepReformer deliver 1.5–2.3× speedup with 15–32% memory, while LUNA, LANO, and MetaLA establish new state-of-the-art long-range modeling under compute parity benchmarks.

A summary table of representative linear attention variants:

Variant Key Mechanism Asymptotic Complexity Distinctive Features
Linear Transformer, Performer Kernel-based factorization O(Nd2)O(Nd^2) Simple, fixed ϕ\phi; random features
FLA, GLA, RetNet Data-dependent gating, low-rank O(Nd2)O(Nd^2) Sharper selectivity, dynamic decay
MALA, LLN Magnitude-awareness, log-normal O(Nd2)O(Nd^2) Softmax-like scaling, statistical matching
LUNA, MetaLA Learnable feature maps O(ND2)O(ND^2) Task-adaptive kernel
HLA, Log-Linear Higher-order/hierarchical states O(Nd2),O(NlogN)O(Nd^2), O(N\log N) Polynomial and multi-scale capacity
LANO (Agent-based) Two-stage attention w/agents O(NMd)O(NMd), MNM\ll N Universal approximation
Hybrid Gen-2/3 Interleaved with softmax layers Mixed SOTA recall at 3–6:1 linear:full
TFLA, FLASHLINEAR Chunk/tiled hardware kernels O(Nd2)O(Nd^2), wall-clock Peak throughput, low memory

These variants are widely applied in domains requiring efficient long-sequence modeling, such as speech/audio (Wang et al., 27 Aug 2025), scientific computing (Zhong et al., 19 Oct 2025, Hu et al., 9 Nov 2025), large-scale NLP (Chou et al., 2024, Wang et al., 8 Jul 2025, Yang et al., 2023, Goldstein et al., 5 May 2025), vision (Nahshan et al., 2023, Fan et al., 1 Jul 2025, Shahbazi et al., 8 Dec 2025), and mixed-modality systems.


In summation, linear attention variants encompass a spectrum of design principles and hybridizations that trade quadratically expensive global mixing for algorithmic, memory, and runtime efficiency. Innovations in kernel design, gating, memory structures, learning paradigms, and hardware-aware implementation have together advanced linear attention close to softmax-level performance across a wide range of benchmarks, while theory provides both guidance and boundaries for their future evolution.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Linear Attention Variants.