Lazy Attention in Transformers

Updated 9 February 2026

Lazy Attention is a self-attention mechanism that refines standard Transformers via focused sparsity and amortized computation to mitigate inefficiency and attention misallocation.
Focused Lazy Attention employs learnable positional biases and Elastic-Softmax normalization to selectively concentrate attention, significantly reducing spurious activations and enabling true sparsity.
LazyFormer reuses computed attention across layers by amortizing its cost, achieving noticeable efficiency gains and maintaining robust performance on long-sequence tasks.

Lazy Attention refers to families of self-attention mechanisms in Transformer architectures that address limitations in standard attention—such as inefficiency, representational collapse, and attention sink—by two primary strategies: (1) focusing and sparsifying attention weights, and (2) amortizing or reusing attention computations across multiple layers, thereby decreasing computational cost. Two distinct methods known as Lazy Attention are prominent: a focused self-attention mechanism introduced by Fu et al. ("Attention Needs to Focus: A Unified Perspective on Attention Allocation," 2026) (Fu et al., 1 Jan 2026), and LazyFormer ("Self Attention with Lazy Update," 2021) (Ying et al., 2021), which amortizes attention computation.

1. Motivation and Background

Standard Transformer attention, based on the softmax-normalized dot-product, is critical for sequence modeling, but exhibits two primary failure modes:

Attention Overload: Occurs in dense contexts when attention is distributed broadly over many keys with similar weights, weakening positional discrimination and causing representational collapse. This is exacerbated when positional cues are diminished or removed.
Attention Underload (Attention Sink): When no key is semantically relevant, softmax normalization (which enforces weights sum to 1) creates spurious foci—often on the first token or a model-learned anchor (the "sink" token)—that are semantically uninformative and degrade flexibility.

These effects underscore fundamental limitations in standard attention normalization and allocation. Additionally, standard attention is computationally expensive: for a sequence of length $N$ and hidden size $d$ , each of the $K$ layers incurs $O(KN^2 d)$ complexity due to repeated attention computation across layers. This motivates methods that improve both computational and allocative efficiency.

2. Focused Lazy Attention: Mechanism and Mathematics

The focused Lazy Attention mechanism (Fu et al., 1 Jan 2026) addresses attention misallocation by:

Positional Discrimination: Incorporating learnable, distance-dependent biases per head and dimension-wise Rotary Positional Embeddings (RoPE) into attention logits. This augments the raw attention score $s_{ij}^{(h)}$ for head $h$ with:

$s_{ij}^{(h)} = \frac{(Q'_i^{(h)})^\top (K'_j^{(h)})}{\sqrt{d_h}} + b^{(h)}_{|i-j|}$

$Q'_i^{(h)}, K'_j^{(h)}$ are RoPE-rotated queries and keys.
$b^{(h)}_{|i-j|}$ $b_{∣ i - j ∣}^{(h)}$ are learnable biases, windowed for efficiency (e.g., $W=512$ $W = 512$ ).
- Elastic-Softmax Normalization: Replaces standard softmax with a two-stage normalization that introduces sparsity:

Compute standard softmax: $d$ 0.
Apply a learnable per-head offset $d$ 1 and ReLU:

$d$ 2

This relaxes the simplex constraint ( $d$ 3) to allow for exact zeros ( $d$ 4), suppressing attention on irrelevant tokens and yielding true sparsity.

Algorithmic Steps

For each multi-head attention layer, the process can be summarized as:

Compute queries, keys, values and apply RoPE.
Compute bias-augmented attention scores over a window.
Normalize via Elastic-Softmax.
Use the resultant sparse weights to compute the weighted sum over values.
Concatenate across heads and apply output projections.

The full procedure incurs no additional asymptotic complexity compared to standard attention, remains compatible with efficient kernels like FlashAttention, and introduces modest (<10%) runtime overhead.

3. LazyFormer: Self-Attention with Amortized Update

LazyFormer (Ying et al., 2021) is a structural variant that amortizes attention cost across blocks of layers. This approach is orthogonal to attention focusing, instead targeting computational efficiency:

Lazy Block Structure: The Transformer is divided into $d$ 5 lazy blocks, each with $d$ 6 layers. The attention matrix $d$ 7 is computed once in the first layer of each block and reused across the remaining $d$ 8 layers.
Block-Level Computation:
- First layer: Standard attention computation.
- Subsequent layers: Retain the fixed attention distribution, update only values and projections. The main saving is that pairwise dot-products for the attention matrix are only computed once per block.

This yields an approximate $d$ 9-fold reduction in the quadratic ( $K$ 0) term of the computational cost, especially significant for long sequences.

Pseudocode (Abbreviated)

$s_{ij}^{(h)}$ 3

4. Empirical Evaluation

Focused Lazy Attention

Empirical results on the FineWeb-Edu corpus (10B/100B tokens, 340M/760M parameters) and multiple benchmarks demonstrate:

Sparsity: Average attention density reduced from ~94.5% (softmax) to ~40.4%, i.e., $K$ 159.6% sparsity.
Sink Removal: Sink ratio (mass on first token) falls from ~5.5% to ~0.18%.
Performance: At 340M parameters, Lazy Attention achieves the highest average accuracy (35.28%) versus Transformer++ or Gated DeltaNet, with statistically significant improvements.
Length Robustness: Perplexity degradation under doubled context length is $K$ 2+0.3 (Lazy Attention) vs. +0.8 (standard Transformer), indicating more robust long-range modeling.
Ablations: Removing positional biases degrades long-context performance; removing Elastic-Softmax reduces sink sparsity benefit.
Offset Variants: The ReLU(softmax $K$ 3), $K$ 4, yields the best perplexity–sparsity trade-off.

Model	Wiki ppl↓	LMB ppl↓	Acc ↑	Density↓	Sink↓
Transformer++	25.76	38.02	33.28	94.5%	5.46%
Lazy Attention	25.32	31.84	35.28	40.2%	0.18%

LazyFormer

Applied to BERT-Base scale (12 layers, 112M parameters):

Efficiency: $K$ 5 blocks yield $K$ 6 speed-up with no loss in accuracy (GLUE-avg $K$ 7); widening layers matches wall-clock time but with +1 GLUE point improvement.
Sequence Lengths: With longer sequences ( $K$ 8), wall-time speed-up approaches $K$ 9.
Optimal Block Size: $O(KN^2 d)$ 0 or $O(KN^2 d)$ 1 strikes the best trade-off; larger $O(KN^2 d)$ 2 degrades accuracy.

Model	Params	Speedup	GLUE-avg
BERT-Base	112M	1x	83.52
M2×6-S	112M	1.3x	83.69
M2×6	157M	1x	84.56

5. Theoretical Properties and Implementation Insights

Focused Lazy Attention

Exact Sparsity: ReLU(softmax $O(KN^2 d)$ 3) sets exactly those softmaxes below $O(KN^2 d)$ 4 to zero, with $O(KN^2 d)$ 5 at init yielding at least $O(KN^2 d)$ 6 zeros for uniformly distributed softmax output.
Learnability: Offset $O(KN^2 d)$ 7 is learnable per head, initialized to enforce high initial sparsity and adaptively relaxed during training.
Gradient Properties: The ReLU introduces a subgradient at the sparsification threshold, and $O(KN^2 d)$ 8 adjusts to avoid dead regions.
Compatibility and Complexity: Retains $O(KN^2 d)$ 9 time and memory, compatible with sliding-window and fused kernels (e.g., FlashAttention).
Overhead: Two-pass implementation (required by offset+ReLU) plus elementwise filtering incurs only $s_{ij}^{(h)}$ 0 additional overhead.

LazyFormer

Parameter Efficiency: Fewer $s_{ij}^{(h)}$ 1 projections allow wider feed-forward and embedding dimensions for a given parameter budget.
Regularization: Eliminates dropout within attention (saving $s_{ij}^{(h)}$ 2 computation) without degrading generalization.
Training Stability: Adam optimizer with standard settings suffices; widening throughput matches or outperforms BERT-Base given fixed wall-clock.

6. Context, Connections, and Significance

Focused Lazy Attention and LazyFormer represent two paradigms of “laziness” in attention:

The first (Focused Lazy Attention) addresses allocation pathology—ensuring that attention focus is driven by semantic and positional relevance, rather than forced by softmax normalization, with beneficial effects on sparsity, interpretability, and representational integrity (Fu et al., 1 Jan 2026).
The second (LazyFormer) targets layer-wise computational redundancy, reusing attention distributions and achieving efficiency gains without substantial accuracy loss (Ying et al., 2021).

Both are compatible with further hardware-optimized implementations and can be integrated with other structural variants and attention approximations. These approaches interpret “lazy attention” either as focused (i.e., sparsifying allocation) or as amortized (i.e., computationally lazy) self-attention, with distinct and sometimes orthogonal benefits.

A plausible implication is that future efficient LLMs may combine focused sparsification and amortized attention computation for optimal trade-offs among accuracy, interpretability, and efficiency.

Markdown Report Issue Upgrade to Chat

References (2)

Attention Needs to Focus: A Unified Perspective on Attention Allocation (2026)

LazyFormer: Self Attention with Lazy Update (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Lazy Attention.

Lazy Attention in Transformers

1. Motivation and Background

2. Focused Lazy Attention: Mechanism and Mathematics

Algorithmic Steps

3. LazyFormer: Self-Attention with Amortized Update

Pseudocode (Abbreviated)

4. Empirical Evaluation

Focused Lazy Attention

LazyFormer

5. Theoretical Properties and Implementation Insights

Focused Lazy Attention

LazyFormer

6. Context, Connections, and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics