Papers
Topics
Authors
Recent
Search
2000 character limit reached

Sparse Row/Column/Global Attention in Transformers

Updated 10 February 2026
  • Sparse attention mechanisms are techniques that compute self-attention over selected token subsets using structured masks to reduce computational load.
  • They leverage row, column, and global patterns to efficiently capture both local and long-range dependencies in diverse data modalities.
  • These methods achieve near-linear scalability and significant memory savings with minimal performance loss by combining fixed and adaptive sparsity strategies.

Sparse Row, Column, and Global Attention mechanisms constitute a class of strategies aimed at reducing the computational and memory complexity of Transformer-style self-attention, while preserving essential long-range and structured dependencies. These methods utilize structured masks, adaptive selection, or data-driven pruning to restrict the set of token-pairs over which attention is computed, often capitalizing on input regularities such as tabular, spatial, or sequential layouts. Sparse row, column, and global attention primitives form the foundation for efficient architectures in language modeling, vision, and tabular in-context learning.

1. Overview and Definitions

Sparse attention refers to any attention pattern where, for a query position ii, attention is only computed for a strict subset of possible key positions jj. Major sparsity patterns include:

  • Row-wise attention: Each query (row) attends to a limited set of keys, commonly realized by local windows, explicit top-kk selection, or masking non-relevant tokens (Zhao et al., 2019).
  • Column-wise attention: Each key (column) only receives attention from a selected subset of queries, which is less common but relevant in certain data layouts.
  • Global attention: Specific "global" tokens—such as special summaries or class tokens—are permitted full dense attention, i.e., they may attend to all positions and vice versa, even when the bulk of tokens are restricted (Bouadi et al., 4 Nov 2025).

Block-sparse attention generalizes these by combining local, global, and random patterns, forming a binary mask M{0,}L×LM \in \{0, -\infty\}^{L \times L} applied to the attention score matrix to enforce the allowed patterns of computation (Bouadi et al., 4 Nov 2025, Wang et al., 8 Sep 2025).

2. Mask Construction and Algorithmic Formulation

Block-Sparse Masking

Masks are defined by a combination of the following building blocks (Bouadi et al., 4 Nov 2025):

  • Windowed (local) attention: For feature tokens i,jFi,j\in F, ijw|i-j| \leq w are allowed; otherwise, attention is zeroed.
  • Global-token (global) attention: For special tokens iSi\in S or jSj\in S, allow all-to-all connectivity.
  • Random links: For each feature token ii, rr random feature tokens jj are selected for connectivity.
  • Self-attention: Diagonal entries remain allowed for proper gradient propagation.

The aggregate mask MM is constructed as the pointwise minimum of the windowed, global, and random masks. Formally,

M=min(Mwin, Mglob, Mrand)M = \min\bigl(M^{\mathrm{win}},\ M^{\mathrm{glob}},\ M^{\mathrm{rand}}\bigr)

The masked self-attention is then

A=softmax(QKTdk+M)A = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right)

yielding nonzero weights only at the prescribed locations.

Row and Column Attention in 2D Grids

For data arranged in 2D (e.g., tabular, image), masking extends naturally:

  • Row-local mask: Mrow(win)((r,c),(r,c))=0M_{\mathrm{row}}^{(\mathrm{win})}((r,c),(r',c')) = 0 if r=rr=r' and ccwcol|c-c'| \leq w_{\mathrm{col}}.
  • Column-local mask: Mcol(win)((r,c),(r,c))=0M_{\mathrm{col}}^{(\mathrm{win})}((r,c),(r',c')) = 0 if c=cc=c' and rrwrow|r-r'| \leq w_{\mathrm{row}}.
  • Row/column global tokens: Special summary tokens connected to all cells in their row or column.

Masks are combined via element-wise minimum, supporting both highly local and structurally global dependencies (Bouadi et al., 4 Nov 2025, Han et al., 2022).

Algorithmic Pseudocode

Implementation involves explicit mask construction; for example, Orion-MSP recurses through the tokens, sets mask entries according to local window, global tokens, random links, and self-attention, with batching over multiple sequences (Bouadi et al., 4 Nov 2025). For explicit top-kk sparsity, the kk largest valid values per query row are kept (Zhao et al., 2019).

3. Structured and Data-Driven Sparsity Patterns

Three families of sparsity emerge:

  • Fixed structured masks: Predefined patterns, such as local windows (row bands) or block structures, often leveraging regularities in the input (Bouadi et al., 4 Nov 2025, Wang et al., 8 Sep 2025).
  • Content-adaptive masks: Learned or dynamically computed patterns, e.g., deformable sampling (Han et al., 2022) or Sinkhorn-permuted block-local attention (Tay et al., 2020), which adapt the receptive field per input.
  • Data-informed global sparsity: Masks learned from averaged attention statistics across a dataset, then applied globally at inference and training—eliminating low-use attention edges (Rugina et al., 2020).

The field distinguishes between patterns that guarantee coverage (e.g., global tokens or random links, for long-range paths in graphs) and those that exploit data-dependent redundancy for maximal efficiency.

4. Complexity Analysis

Sparse row, column, and global attention mechanisms achieve substantial reductions in both compute and memory complexity versus dense self-attention, as detailed below:

Method Time Complexity Memory Requirement Comments
Dense (standard) O(L2d)O(L^2 d) O(L2)O(L^2) Full attention
Windowed (local) O(Lwd)O(L w d) O(Lw)O(L w) wLw \ll L
Block-sparse global O(knd)O(k n d) with knk \ll n O((1ρ)n2)O((1-\rho)n^2) kk active blocks per row, sparsity ρ\rho (Wang et al., 8 Sep 2025)
Top-kk row-wise O(L2d)O(L^2 d) (compute full QKTQK^T), O(Lkd)O(L k d) (for AVA V) O(Lk)O(L k) Significantly reduced memory (Zhao et al., 2019)
Sparse non-local O(NKC)O(N K C) O(NK)O(N K) Each query samples KK keys (Liu et al., 2021)
Sinkhorn + block sort O(bd+KNB2)O(\ell b d + K N_B^2) O(b2+NB2)O(b^2 + N_B^2) b=b= block size, NB=/bN_B=\ell/b (Tay et al., 2020)
2D row/col attention O(RC(2wrow+2wcol+Srow+Scol+r)d)O(RC(2w_{\mathrm{row}} + 2w_{\mathrm{col}} + |S_{\mathrm{row}}|+|S_{\mathrm{col}}|+r’)d) O()O(\ldots) Extends pattern to tabular/image data (Bouadi et al., 4 Nov 2025)

In the regime where w,r,NspecialLw,r,N_{\mathrm{special}} \ll L, or for small KK in SSANet, the effective cost is near-linear in input size, contrasting the quadratic scaling of dense attention.

5. Methodological Variants and Empirical Results

Block-Sparse Attention With Multi-Pattern Masks

Orion-MSP achieves efficient table-wide processing in tabular in-context learning through multi-scale, block-sparse attention, combining local (window), global, and random connectivity, supporting learnable hierarchical feature representations (Bouadi et al., 4 Nov 2025).

Row and Column Attention for Structured Vision Tasks

Laneformer applies exact row and column self-attention to pixel-level CNN features, achieving cost O(H2+W2)O(H^2 + W^2) vs.\ O(H2W2)O(H^2 W^2) for global, with strong empirical accuracy on lane-detection (77.1% F1 on CULane) and negligible latency penalty (Han et al., 2022).

Data-Informed Global Sparseness

The AP framework computes masks from average attention patterns post-hoc, applying them globally. With up to 90% pruning in language modeling, perplexity only increased by 5%5\% (24.16→26.01 on WikiText-103), and for SQuAD reduces GPU memory usage by 27%27\% at trivial F1 loss (Rugina et al., 2020).

Explicit Row-Wise Sparsity

EST uses hard top-kk masking per query row. With k=8k=8, achieves nontrivial memory relief (O(Lk)O(Lk)) and slight performance improvements (e.g., En→De: BLEU 29.4 vs. 29.1 for dense), delivering twice the inference speed-up over Sparsemax/Entmax attention (Zhao et al., 2019).

Adaptively Learned Block-Sparsity

Faster VGGT exploits pattern concentration in multi-view geometry transformers to select only the most informative b×bb \times b blocks, with 75%75\% block sparsity yielding up to 4×4\times raw attention speed-up and keeping task metrics within 2% of baseline (Wang et al., 8 Sep 2025).

Alternating Local-Global and Latent Variants

Alternating Sparse Attention alternates sliding-window (local, row-banded) and global (compression, selective) layers, enhancing with Multi-Head Latent Attention and Group-head Latent Attention. This halves KV-cache relative to Native Sparse Attention and provides superior long-context retention (Hu et al., 2 Nov 2025).

6. Extensions, Implementation, and Hardware Considerations

Sparse attention is implemented via custom CUDA kernels (e.g., block-sparse FlashAttention2), often requiring block-major storage for Q/K/V, and regularity in block or window pattern for maximal hardware efficiency (Wang et al., 8 Sep 2025, Bouadi et al., 4 Nov 2025). Batched mask construction supports scalable computation across multiple input tables or images. Dynamic content-adaptive sparsity (Sinkhorn, deformable, datadriven pruning) introduces additional softmax or sampling overhead but permits flexible receptive field adaptation (Tay et al., 2020, Liu et al., 2021, Rugina et al., 2020).

7. Applications and Empirical Trade-Offs

Sparse row, column, and global attention mechanisms have achieved:

  • Near-linear scalability in high-dimensional tabular in-context learning (Bouadi et al., 4 Nov 2025)
  • Real-time lane detection with state-of-the-art accuracy at low FLOP cost (Han et al., 2022)
  • Dramatic inference and memory reductions in language modeling, vision, and multi-view geometry (Rugina et al., 2020, Wang et al., 8 Sep 2025)
  • Accurate concentration of attention for tasks requiring long-range dependencies via structured random and global links

A common observation is minimal degradation (often <2%) in primary task metrics even with aggressive sparsity (as in 90% attention-pruning in language modeling (Rugina et al., 2020), 75% block-sparsity in multi-view vision (Wang et al., 8 Sep 2025), or k=8k=8 top-kk selection (Zhao et al., 2019)). Sparser patterns tend to be more robust in self-attention than cross-attention settings, as the latter's variability hampers mask effectiveness (Rugina et al., 2020). Row/column and global token mechanisms are effective in cases with regular 2D structure or when certain tokens (e.g., class summaries) must aggregate global information.


References:

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Sparse Row/Column/Global Attention.