Linear Attention Variants Overview

Updated 9 February 2026

Linear attention variants are efficient mechanisms that use kernel approximations to replace quadratic softmax, enabling scalable long-range modeling.
They employ techniques such as kernelization, gating, and hierarchical memory to achieve linear or subquadratic runtimes while reducing memory usage.
These methods have practical applications in NLP, vision, and scientific computing, balancing speed and efficiency with some trade-offs in expressiveness.

Linear attention variants constitute a broad and rapidly evolving class of attention mechanisms designed to circumvent the quadratic complexity bottleneck of softmax-based attention, enabling the efficient modeling of long-range dependencies in sequence models, transformers, and neural operators. These methods replace or approximate the standard attention kernel or memory layout, yielding subquadratic—often strictly linear—runtime and memory with respect to sequence length. Linear attention variants span from fixed or kernelized low-rank factorization and data-dependent gating, to hierarchical, agent-based mediation and log-linear memory layouts. This article systematically catalogs principal linear attention designs, their mathematical foundations, empirical impact, theoretical limits, and open questions, richly referencing recent literature.

1. Canonical Linear Attention: Kernelization and Feature Maps

The core principle of linear attention is to replace the quadratic dot-product attention $\mathrm{softmax}(QK^\top/\sqrt{d}),$ which requires explicit computation and storage of the $N\times N$ attention matrix, with a kernel function that admits an associative factorization. This is typically achieved via a feature mapping $\phi:\mathbb{R}^d\rightarrow\mathbb{R}^r$ such that

$\exp(Q_i K_j^\top/\sqrt{d}) \approx \phi(Q_i)^\top \phi(K_j),$

yielding the attention output, per token $i$ ,

$A(Q,K,V)_i = \frac{\phi(Q_i) \sum_{j=1}^N \phi(K_j)^\top V_j}{\phi(Q_i) \sum_{m=1}^N \phi(K_m)^\top}.$

Reordering allows for $O(Nd^2)$ computation, as $\sum_j \phi(K_j)^\top V_j$ can be precomputed and shared across all queries (Fan et al., 1 Jul 2025, Nahshan et al., 2023).

This design encompasses:

Linear Transformer: $\phi(x)=\mathrm{elu}(x)+1$ (Fan et al., 1 Jul 2025).
Performer: $\phi$ via random feature approximation (e.g., FAVOR+) (Lee et al., 2023, Nahshan et al., 2023).
Gated versions: introduce elementwise or matrix gating into the update (Yang et al., 2023).

Primitive linear attention suffers in expressivity and concentration versus softmax, often yielding under-concentrated attention maps and degraded accuracy.

2. Gated, Focused, and Magnitude-Aware Linear Attention

Several improvements address the limitations of kernel linear attention by enhancing selectivity, contextualization, or distributional fidelity.

Focused Linear Attention (FLA) introduces a learned “focus” kernel via

$N\times N$ 0

sharpening similarity structure, combined with a depthwise convolutional bias and a gated MLP applied post-attention. FLA is adopted in speech separation models (FLA-SepReformer and FLA-TFLocoformer), demonstrating 1.5–2.3× speedup and up to a 5× reduction in memory usage (depending on model size), while retaining near SOTA SI-SNRi/SDRi (Wang et al., 27 Aug 2025).

Magnitude-Aware Linear Attention (MALA) targets the “magnitude neglect” property of conventional variants—the cancellation of $N\times N$ 1 in both numerator and denominator—which leads to flat attention scaling under query norm changes:

$N\times N$ 2

MALA introduces an additive normalization and a per-query scale parameter $N\times N$ 3, restoring softmax-like sharpness dynamics:

$N\times N$ 4

where $N\times N$ 5, $N\times N$ 6. This correction brings MALA’s attention distribution much closer to softmax, empirically closing the accuracy gap for vision and language tasks (Fan et al., 1 Jul 2025).

Linear Log-Normal Attention (LLN) reconstructs the distributional statistics of softmax attention by matching the log-normal shape of its entries through exponential feature maps:

$N\times N$ 7

with hyperparameters $N\times N$ 8 chosen via variance matching. LLN achieves attention concentration behavior closely paralleling softmax but with $N\times N$ 9 complexity, outperforming other linearized alternatives on GLUE, LRA, and imaging benchmarks, especially when combined with local softmax on blocks (Nahshan et al., 2023).

3. Advanced Mechanisms: Memory Expansion, Agents, and Hierarchy

Linear attention mechanisms can be significantly extended using hierarchical, agent, and higher-order structures.

Log-Linear Attention replaces the single fixed-size hidden state of a standard linear recurrent update with $\phi:\mathbb{R}^d\rightarrow\mathbb{R}^r$ 0 states at different temporal resolutions. Each token update distributes context into multiple Fenwick tree–style aggregation buckets, resulting in per-token compute and memory of $\phi:\mathbb{R}^d\rightarrow\mathbb{R}^r$ 1 and aggregate compute of $\phi:\mathbb{R}^d\rightarrow\mathbb{R}^r$ 2. Applied to architectures like Mamba-2 and Gated DeltaNet, log-linear variants dramatically improve long-range recall and out-of-distribution generalization (Guo et al., 5 Jun 2025).

Higher-Order Linear Attention (HLA) generalizes linear attention by maintaining streaming sufficient statistics for higher-degree polynomials of queries and keys, e.g., second-order moments:

$\phi:\mathbb{R}^d\rightarrow\mathbb{R}^r$ 3

This enables expressivity beyond simple pairwise mixing, with only $\phi:\mathbb{R}^d\rightarrow\mathbb{R}^r$ 4 per-token cost. Masked and higher-order variants are constructed via associative scan algorithms (Zhang et al., 31 Oct 2025).

Agent-based Mechanisms (LANO) utilize a small number $\phi:\mathbb{R}^d\rightarrow\mathbb{R}^r$ 5 of “agent” tokens that mediate global information exchange between all $\phi:\mathbb{R}^d\rightarrow\mathbb{R}^r$ 6 sequence positions. LANO constructs a two-stage attention effect, first aggregating from all keys (softmaxed) to each agent, then mediating from agents to sequence queries. This scheme matches softmax-level expressivity, supports universal approximation of integral operators, and achieves significant performance gains (19.5% average error reduction) on PDE benchmarks (Zhong et al., 19 Oct 2025).

4. Learnable Feature Maps and Hybrid Variants

A recent thrust in linear attention research is eliminating the reliance on fixed kernel feature maps by fully learning $\phi:\mathbb{R}^d\rightarrow\mathbb{R}^r$ 7.

LUNA (Linear Universal Neural Attention) parameterizes the kernel feature map with a composition of MLP “channel” functions, linear projections, and token-wise envelopes. This ensures the resulting kernel remains positive-definite and adaptive to the data geometry, overcoming the representational limitations of static random features. LUNA matches or exceeds prior efficient transformer accuracy on LRA and, after a brief fine-tuning phase, recovers >99% of a pretrained BERT’s or ViT’s accuracy in post-hoc conversion experiments. Theoretical guarantees underpin LUNA, establishing both universal approximation and explicit generalization error bounds (Shahbazi et al., 8 Dec 2025).

Hybrid Linear–Full Attention strategies, wherein linear attention layers are interleaved with occasional softmax layers, allow for near-quadratic recall at a fraction of the compute/memory cost. Systematic analysis confirms that, for language modeling, quality is stable across wide hybridization ratios, but recall (diagnosed via “RULER” tasks) saturates only when at least one full softmax layer per 3–6 linear layers is included (e.g., HGRN-2 hybrid at $\phi:\mathbb{R}^d\rightarrow\mathbb{R}^r$ 8 or GatedDeltaNet hybrid at $\phi:\mathbb{R}^d\rightarrow\mathbb{R}^r$ 9 ratios) (Wang et al., 8 Jul 2025). However, there exists a provable expressiveness hierarchy: for multi-step function composition tasks, full attention nets with $\exp(Q_i K_j^\top/\sqrt{d}) \approx \phi(Q_i)^\top \phi(K_j),$ 0 layers strictly outperform any hybrid network with $\exp(Q_i K_j^\top/\sqrt{d}) \approx \phi(Q_i)^\top \phi(K_j),$ 1 full attention layers and even exponentially many linear layers (Ye et al., 2 Feb 2026).

5. Efficient and Interpretable Implementations

Hardware-efficient algorithms such as FLASHLINEARATTENTION and its generalization to matrix-gated recurrences (GLA), or Tiled Flash Linear Attention (TFLA), address not only asymptotic complexity but also practical wall-clock throughput and GPU memory use. By chunkwise and tiled parallelization, careful state materialization, and memory-movement minimization, these implementations achieve 2–4× speedup over even optimized softmax kernels (FlashAttention-2/3), especially on long sequences (up to 32K tokens) (Yang et al., 2023, Beck et al., 18 Mar 2025).

Distillation and interpretability are increasingly important. Distillation protocols (e.g., RADLADS) allow rapid conversion of pretrained softmax models into linear-attention decoders (e.g., RAD-RWKV6/7), with minimal performance loss and large speed gains at scale (Goldstein et al., 5 May 2025). Similarly, SEA (Sparse Linear Attention with Estimated Mask) uses efficient kernel estimation and learned sparse masking to offer both interpretability (full or approximate attention matrices) and memory/compute efficiency comparable to standard kernelized models (Lee et al., 2023).

6. Theoretical Guarantees, Limitations, and Design Criteria

Theoretical analysis clarifies both opportunities and inherent trade-offs:

Optimal Linear Approximations: MetaLA offers a unified optimal framework, showing that dynamic memory (gating), static approximation (via queries plus decay), and parameter minimality are jointly achievable via an RNN-like update: $\exp(Q_i K_j^\top/\sqrt{d}) \approx \phi(Q_i)^\top \phi(K_j),$ 2.
Universal Approximation: Agent-based (LANO), learned-kernel (LUNA), and hierarchical mechanisms can, under suitable parameterization, approximate general integral operators or sequence mixing kernels to arbitrary precision (Zhong et al., 19 Oct 2025, Shahbazi et al., 8 Dec 2025).
Expressiveness Bounds: No linear attention model, regardless of width or depth, matches the compositional reasoning power of even a modestly deep full attention network (Ye et al., 2 Feb 2026).

Principal limitations of most linear variants include:

Reduced ability to model sharply concentrated distributions (“spikiness”) compared to softmax, though MALA/LLN and some learnable kernel approaches mitigate this;
Insufficient memory or capacity for exact recall over very long or highly multi-hop contexts, unless augmented with hybrid, hierarchical, or log-linear structures (Wang et al., 8 Jul 2025, Guo et al., 5 Jun 2025);
Potential for instability or approximation error if the feature map is poorly chosen or learned without explicit PD constraints (Shahbazi et al., 8 Dec 2025, Nahshan et al., 2023).

7. Empirical Performance and Application Domains

Experimental evaluations consistently demonstrate that state-of-the-art linear attention variants (especially those employing gating, learnable kernels, or log-linear memory) achieve performance on par with, or exceeding, softmax attention on tasks such as speech separation (Wang et al., 27 Aug 2025), PDE solvers (Zhong et al., 19 Oct 2025, Hu et al., 9 Nov 2025), language modeling (Chou et al., 2024, Wang et al., 8 Jul 2025, Yang et al., 2023), image classification (Nahshan et al., 2023, Fan et al., 1 Jul 2025, Shahbazi et al., 8 Dec 2025), and time-series forecasting as structured VAR models (Lu et al., 11 Feb 2025). Architectures like FLA-TFLocoformer and FLA-SepReformer deliver 1.5–2.3× speedup with 15–32% memory, while LUNA, LANO, and MetaLA establish new state-of-the-art long-range modeling under compute parity benchmarks.

A summary table of representative linear attention variants:

Variant	Key Mechanism	Asymptotic Complexity	Distinctive Features
Linear Transformer, Performer	Kernel-based factorization	$\exp(Q_i K_j^\top/\sqrt{d}) \approx \phi(Q_i)^\top \phi(K_j),$ 3	Simple, fixed $\exp(Q_i K_j^\top/\sqrt{d}) \approx \phi(Q_i)^\top \phi(K_j),$ 4; random features
FLA, GLA, RetNet	Data-dependent gating, low-rank	$\exp(Q_i K_j^\top/\sqrt{d}) \approx \phi(Q_i)^\top \phi(K_j),$ 5	Sharper selectivity, dynamic decay
MALA, LLN	Magnitude-awareness, log-normal	$\exp(Q_i K_j^\top/\sqrt{d}) \approx \phi(Q_i)^\top \phi(K_j),$ 6	Softmax-like scaling, statistical matching
LUNA, MetaLA	Learnable feature maps	$\exp(Q_i K_j^\top/\sqrt{d}) \approx \phi(Q_i)^\top \phi(K_j),$ 7	Task-adaptive kernel
HLA, Log-Linear	Higher-order/hierarchical states	$\exp(Q_i K_j^\top/\sqrt{d}) \approx \phi(Q_i)^\top \phi(K_j),$ 8	Polynomial and multi-scale capacity
LANO (Agent-based)	Two-stage attention w/agents	$\exp(Q_i K_j^\top/\sqrt{d}) \approx \phi(Q_i)^\top \phi(K_j),$ 9, $i$ 0	Universal approximation
Hybrid Gen-2/3	Interleaved with softmax layers	Mixed	SOTA recall at 3–6:1 linear:full
TFLA, FLASHLINEAR	Chunk/tiled hardware kernels	$i$ 1, wall-clock	Peak throughput, low memory

These variants are widely applied in domains requiring efficient long-sequence modeling, such as speech/audio (Wang et al., 27 Aug 2025), scientific computing (Zhong et al., 19 Oct 2025, Hu et al., 9 Nov 2025), large-scale NLP (Chou et al., 2024, Wang et al., 8 Jul 2025, Yang et al., 2023, Goldstein et al., 5 May 2025), vision (Nahshan et al., 2023, Fan et al., 1 Jul 2025, Shahbazi et al., 8 Dec 2025), and mixed-modality systems.

In summation, linear attention variants encompass a spectrum of design principles and hybridizations that trade quadratically expensive global mixing for algorithmic, memory, and runtime efficiency. Innovations in kernel design, gating, memory structures, learning paradigms, and hardware-aware implementation have together advanced linear attention close to softmax-level performance across a wide range of benchmarks, while theory provides both guidance and boundaries for their future evolution.