Papers
Topics
Authors
Recent
Search
2000 character limit reached

Lazy Attention in Transformers

Updated 9 February 2026
  • Lazy Attention is a self-attention mechanism that refines standard Transformers via focused sparsity and amortized computation to mitigate inefficiency and attention misallocation.
  • Focused Lazy Attention employs learnable positional biases and Elastic-Softmax normalization to selectively concentrate attention, significantly reducing spurious activations and enabling true sparsity.
  • LazyFormer reuses computed attention across layers by amortizing its cost, achieving noticeable efficiency gains and maintaining robust performance on long-sequence tasks.

Lazy Attention refers to families of self-attention mechanisms in Transformer architectures that address limitations in standard attention—such as inefficiency, representational collapse, and attention sink—by two primary strategies: (1) focusing and sparsifying attention weights, and (2) amortizing or reusing attention computations across multiple layers, thereby decreasing computational cost. Two distinct methods known as Lazy Attention are prominent: a focused self-attention mechanism introduced by Fu et al. ("Attention Needs to Focus: A Unified Perspective on Attention Allocation," 2026) (Fu et al., 1 Jan 2026), and LazyFormer ("Self Attention with Lazy Update," 2021) (Ying et al., 2021), which amortizes attention computation.

1. Motivation and Background

Standard Transformer attention, based on the softmax-normalized dot-product, is critical for sequence modeling, but exhibits two primary failure modes:

  • Attention Overload: Occurs in dense contexts when attention is distributed broadly over many keys with similar weights, weakening positional discrimination and causing representational collapse. This is exacerbated when positional cues are diminished or removed.
  • Attention Underload (Attention Sink): When no key is semantically relevant, softmax normalization (which enforces weights sum to 1) creates spurious foci—often on the first token or a model-learned anchor (the "sink" token)—that are semantically uninformative and degrade flexibility.

These effects underscore fundamental limitations in standard attention normalization and allocation. Additionally, standard attention is computationally expensive: for a sequence of length NN and hidden size dd, each of the KK layers incurs O(KN2d)O(KN^2 d) complexity due to repeated attention computation across layers. This motivates methods that improve both computational and allocative efficiency.

2. Focused Lazy Attention: Mechanism and Mathematics

The focused Lazy Attention mechanism (Fu et al., 1 Jan 2026) addresses attention misallocation by:

  • Positional Discrimination: Incorporating learnable, distance-dependent biases per head and dimension-wise Rotary Positional Embeddings (RoPE) into attention logits. This augments the raw attention score sij(h)s_{ij}^{(h)} for head hh with:

$s_{ij}^{(h)} = \frac{(Q'_i^{(h)})^\top (K'_j^{(h)})}{\sqrt{d_h}} + b^{(h)}_{|i-j|}$

  • $Q'_i^{(h)}, K'_j^{(h)}$ are RoPE-rotated queries and keys.
  • bij(h)b^{(h)}_{|i-j|} are learnable biases, windowed for efficiency (e.g., W=512W=512).
    • Elastic-Softmax Normalization: Replaces standard softmax with a two-stage normalization that introduces sparsity:
  1. Compute standard softmax: α~ij(h)=exp(sij(h))/k=1iexp(sik(h))\tilde{\alpha}_{ij}^{(h)} = \exp(s_{ij}^{(h)}) / \sum_{k=1}^i \exp(s_{ik}^{(h)}).
  2. Apply a learnable per-head offset τ(h)\tau^{(h)} and ReLU:

    αij(h)=ReLU(α~ij(h)+τ(h)i)\alpha_{ij}^{(h)} = \operatorname{ReLU}\left( \tilde{\alpha}_{ij}^{(h)} + \frac{\tau^{(h)}}{i} \right)

This relaxes the simplex constraint (jαij=1\sum_j \alpha_{ij} = 1) to allow for exact zeros (jαij1\sum_j \alpha_{ij} \le 1), suppressing attention on irrelevant tokens and yielding true sparsity.

Algorithmic Steps

For each multi-head attention layer, the process can be summarized as:

  1. Compute queries, keys, values and apply RoPE.
  2. Compute bias-augmented attention scores over a window.
  3. Normalize via Elastic-Softmax.
  4. Use the resultant sparse weights to compute the weighted sum over values.
  5. Concatenate across heads and apply output projections.

The full procedure incurs no additional asymptotic complexity compared to standard attention, remains compatible with efficient kernels like FlashAttention, and introduces modest (<10%) runtime overhead.

3. LazyFormer: Self-Attention with Amortized Update

LazyFormer (Ying et al., 2021) is a structural variant that amortizes attention cost across blocks of layers. This approach is orthogonal to attention focusing, instead targeting computational efficiency:

  • Lazy Block Structure: The Transformer is divided into B=K/mB=K/m lazy blocks, each with mm layers. The attention matrix A(b)A^{(b)} is computed once in the first layer of each block and reused across the remaining m1m-1 layers.
  • Block-Level Computation:
    • First layer: Standard attention computation.
    • Subsequent layers: Retain the fixed attention distribution, update only values and projections. The main saving is that pairwise dot-products for the attention matrix are only computed once per block.

This yields an approximate mm-fold reduction in the quadratic (O(N2d)O(N^2 d)) term of the computational cost, especially significant for long sequences.

Pseudocode (Abbreviated)

1
2
3
4
5
6
7
8
9
10
11
12
function LazyBlock(X, {W^Q_1, W^K_1, W^V_1}, {W^V_l, W^O_l, FFN_l}_{l=1..m}):
    Q1 = X @ W^Q_1
    K1 = X @ W^K_1
    V1 = X @ W^V_1
    A  = softmax(Q1 @ K1^T / sqrt(d_k))
    H = X
    for l in 1..m:
        V_l = V1 if l == 1 else H @ W^V_l
        S_l = A @ V_l
        Z = LayerNorm(H + S_l @ W^O_l)
        H = LayerNorm(Z + FFN_l(Z))
    return H

4. Empirical Evaluation

Focused Lazy Attention

Empirical results on the FineWeb-Edu corpus (10B/100B tokens, 340M/760M parameters) and multiple benchmarks demonstrate:

  • Sparsity: Average attention density reduced from ~94.5% (softmax) to ~40.4%, i.e., \sim59.6% sparsity.
  • Sink Removal: Sink ratio (mass on first token) falls from ~5.5% to ~0.18%.
  • Performance: At 340M parameters, Lazy Attention achieves the highest average accuracy (35.28%) versus Transformer++ or Gated DeltaNet, with statistically significant improvements.
  • Length Robustness: Perplexity degradation under doubled context length is \sim+0.3 (Lazy Attention) vs. +0.8 (standard Transformer), indicating more robust long-range modeling.
  • Ablations: Removing positional biases degrades long-context performance; removing Elastic-Softmax reduces sink sparsity benefit.
  • Offset Variants: The ReLU(softmax+τ/i+\tau/i), τinit=1\tau_{\text{init}}=-1, yields the best perplexity–sparsity trade-off.
Model Wiki ppl↓ LMB ppl↓ Acc ↑ Density↓ Sink↓
Transformer++ 25.76 38.02 33.28 94.5% 5.46%
Lazy Attention 25.32 31.84 35.28 40.2% 0.18%

LazyFormer

Applied to BERT-Base scale (12 layers, 112M parameters):

  • Efficiency: (m=2)(m=2) blocks yield 1.3×1.3\times speed-up with no loss in accuracy (GLUE-avg $83.7$); widening layers matches wall-clock time but with +1 GLUE point improvement.
  • Sequence Lengths: With longer sequences (N2048N \approx 2048), wall-time speed-up approaches 2×2\times.
  • Optimal Block Size: m=2m=2 or $3$ strikes the best trade-off; larger mm degrades accuracy.
Model Params Speedup GLUE-avg
BERT-Base 112M 1x 83.52
M2×6-S 112M 1.3x 83.69
M2×6 157M 1x 84.56

5. Theoretical Properties and Implementation Insights

Focused Lazy Attention

  • Exact Sparsity: ReLU(softmax+τ/i+\tau/i) sets exactly those softmaxes below τ/i- \tau/i to zero, with τ=1\tau=-1 at init yielding at least (i1)/i(i-1)/i zeros for uniformly distributed softmax output.
  • Learnability: Offset τ\tau is learnable per head, initialized to enforce high initial sparsity and adaptively relaxed during training.
  • Gradient Properties: The ReLU introduces a subgradient at the sparsification threshold, and τ\tau adjusts to avoid dead regions.
  • Compatibility and Complexity: Retains O(n2)O(n^2) time and memory, compatible with sliding-window and fused kernels (e.g., FlashAttention).
  • Overhead: Two-pass implementation (required by offset+ReLU) plus elementwise filtering incurs only 10%\lesssim 10\% additional overhead.

LazyFormer

  • Parameter Efficiency: Fewer Q/KQ/K projections allow wider feed-forward and embedding dimensions for a given parameter budget.
  • Regularization: Eliminates dropout within attention (saving O(N2)O(N^2) computation) without degrading generalization.
  • Training Stability: Adam optimizer with standard settings suffices; widening throughput matches or outperforms BERT-Base given fixed wall-clock.

6. Context, Connections, and Significance

Focused Lazy Attention and LazyFormer represent two paradigms of “laziness” in attention:

  • The first (Focused Lazy Attention) addresses allocation pathology—ensuring that attention focus is driven by semantic and positional relevance, rather than forced by softmax normalization, with beneficial effects on sparsity, interpretability, and representational integrity (Fu et al., 1 Jan 2026).
  • The second (LazyFormer) targets layer-wise computational redundancy, reusing attention distributions and achieving efficiency gains without substantial accuracy loss (Ying et al., 2021).

Both are compatible with further hardware-optimized implementations and can be integrated with other structural variants and attention approximations. These approaches interpret “lazy attention” either as focused (i.e., sparsifying allocation) or as amortized (i.e., computationally lazy) self-attention, with distinct and sometimes orthogonal benefits.

A plausible implication is that future efficient LLMs may combine focused sparsification and amortized attention computation for optimal trade-offs among accuracy, interpretability, and efficiency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Lazy Attention.