Sliding Tile Attention in Transformers
- Sliding Tile Attention (STA) is a hardware-aware block-sparse attention mechanism that reduces compute complexity by leveraging tile-level sparsity in multi-dimensional transformers.
- STA minimizes memory overhead and accelerates processing by attending only to key blocks, transforming quadratic complexity into a more efficient linear or subquadratic regime.
- Empirical results show STA achieves up to 46% end-to-end speedup in video and vision applications with minimal performance degradation.
Sliding Tile Attention (STA), also known as Attention Tile, is a hardware-aware, block-sparse attention mechanism designed to provide highly efficient, expressive, and locality-biased alternatives to full attention in multi-dimensional transformer architectures, primarily for video and high-resolution vision generative models. STA leverages block-structured sparsity at the tile level to minimize compute and memory overhead, transforming the dominant quadratic complexity of standard attention into a linear–or subquadratic–cost regime, all while preserving or closely matching task performance metrics. Empirical and analytical studies demonstrate that STA achieves substantial acceleration over both vanilla dense and previous sparse attention baselines, avoiding the computational pitfalls that make standard neighborhood or sliding window methods inefficient on modern hardware (Ding et al., 10 Feb 2025, Zhang et al., 6 Feb 2025, Hassani et al., 23 Apr 2025).
1. Underlying Locality and Motivation
Contemporary generative video transformers (DiTs) encode a video of dimension into a sequence of length and apply self-attention at all pairwise positions, yielding time and memory complexity. Profiling, e.g., the HunyuanVideo model, shows attention may consume 800 out of 945 seconds of inference time for a single , video. However, analyses of 3D attention maps from pretrained DiTs reveal a strong inductive bias: more than 70% of attention mass is concentrated within a cubic window enclosing just 15.5% of total tokens, regardless of prompt or diffusion timestep. This locality property persists across model layers and diffusion steps, and head specialization is observed: different attention heads focus on distinct spatial–temporal windows, but overall the local pattern is highly prompt-agnostic (Zhang et al., 6 Feb 2025).
2. Definition and Mathematical Formulation
STA arranges the attention computation into a block-sparse scheme. Consider a video with frames, spatial tokens per frame, and :
- Full 3D attention forms an matrix, naturally interpreted as an grid of tiles.
- Main diagonal tiles (within-frame) hold the largest scores, while off-diagonal tiles (across frames) decay in importance as temporal distance increases.
- The set of important tiles is highly input-independent.
STA enforces, for each query block :
- Always attend to its own frame (main diagonal).
- Attend to a constant number of additional reference frames (global frames), selected uniformly (or learned).
- All other tiles are masked.
Formally, with partitioned into blocks of size (each ), and a binary mask of block structure: where is the set of global references.
Sparse attention is computed as: or, equivalently,
with in masked entries (Ding et al., 10 Feb 2025).
Generalizing to multi-dimensional problems, STA can be interpreted in the Generalized Neighborhood Attention (GNA) framework with parameterized window width and stride , where all queries in a tile (stride- group) share the same window of keys (Hassani et al., 23 Apr 2025).
3. Complexity and Hardware Efficiency
The complexity for dense attention is . For STA:
- Each of query blocks attends only blocks of size , so time is , which is linear in the number of frames when is small.
- Memory scales as .
For general sliding-tile schemes in dimensions (), using tile size , window , and stride , STA achieves the following per-token: but crucially, STA aligns the group/window structure to the compute kernel’s matmul tiles, eliminating the fine-grained masking and “mixed block” inefficiencies of prior sliding-window attention. This enables perfectly block-sparse execution, where speedups match theoretical flop reductions, e.g., 91% sparsity leading to operator-level and end-to-end speedup on HunyuanVideo (Hassani et al., 23 Apr 2025).
Table: Operator Complexity Comparison
| Method | Time per Token | Memory Reads/Token |
|---|---|---|
| Dense Self-Attention | $2N d$ | |
| Sliding-Window | $2W d$ | |
| Blocked Attention | $2B d$ | |
| STA (window , stride ) | $2W d$ |
STA’s memory traffic on is reduced proportionally to , and its use of dense, mask-free GEMM blocks yields near-maximal MFU (e.g., 58.79% MFU on H100 GPUs, matching 94% of dense attention’s measured math-utilization) (Zhang et al., 6 Feb 2025, Hassani et al., 23 Apr 2025).
4. STA Algorithm, Implementation, and Tuning
Forward Pass
For a multi-head transformer layer using STA (Ding et al., 10 Feb 2025):
1 2 3 4 5 6 7 8 9 10 11 |
Input: X ∈ ℝ^{N×d}, mask ℳ specifying global frames per tile 1. Compute Q, K, V ← X W_Q, X W_K, X W_V ∈ ℝ^{N×d_h} (per head) 2. Reshape Q, K, V into [F, P, d_h] 3. For i in 1…F: For j in {i}∪ℛ_i: # diagonal and global frames S ← Q[i] @ K[j].T / sqrt(d_h) # (P×P) A ← softmax_rowwise(S) Out += A @ V[j] Store Out as Y_block[i] 4. Concatenate Y_block[1…F] → Y ∈ ℝ^{N×d_h} 5. Merge heads, output-projection for final Y |
Block/Mixed Mask Avoidance
Kernel implementations (e.g., ThunderKittens, CUTLASS FMHA for Blackwell) process token tiles by flattening the -dimensional grid, preloading each tile’s keys/values into SRAM, and streaming dense tile-tile matmuls to compute warps. Query and key blocks are perfectly matched, so each threadblock operates exclusively on dense slices—eliminating mask logic and maximizing data reuse (Zhang et al., 6 Feb 2025, Hassani et al., 23 Apr 2025). This property distinguishes STA from token-wise sliding window attention, which produces many partially-sparse “mixed blocks” that cause compute and memory copy inefficiencies.
Training-Free and Adaptive Mask Selection
A training-free mask search (Algorithm 1) is feasible: for each layer/head, several candidate (window, stride) sparsity patterns are evaluated on a set of validation samples; the sparsest setting that keeps the MSE drift below a threshold is chosen for deployment. For more aggressive sparsity, fine-tuning with distillation recovers any lost quality (Ding et al., 10 Feb 2025, Zhang et al., 6 Feb 2025).
5. Empirical Performance and Quality
Benchmarks on major generative models validate that STA achieves substantial speedups without compromising performance:
- Inference time (HunyuanVideo, 5s, 720p): STA reduces end-to-end latency from 945 s (FlashAttention-3) to 268 s after fine-tuning, with only a 0.09% drop in VBench total score (Zhang et al., 6 Feb 2025).
- Operator-level speedup matches flop-based predictions with measured-to-theoretical speedup ratios within 2%.
- On Open-Sora-Plan-1.2 (Efficient-vDiT), a 7.4–7.8x speedup for 29/93-frame $720p$ video is obtained with less than 1% degradation on VBench and CD-FVD (Ding et al., 10 Feb 2025).
- STA demonstrates transferability to 2D tasks (e.g., FLUX super-resolution) and other transformer domains, with 45% speedup and essentially unchanged SSIM (Zhang et al., 6 Feb 2025, Hassani et al., 23 Apr 2025).
- Quality-preserving properties are attributed to the data-independence and stability of tile selection: the top 90% of attention mass is shared across >90% of prompts (Ding et al., 10 Feb 2025).
6. Generalizations and Theoretical Context
STA is a special case of the Generalized Neighborhood Attention (GNA) framework, which encompasses standard sliding window (stride=1), block (stride=window), and intermediate strided layouts corresponding to tile-based attention (Hassani et al., 23 Apr 2025). GNA explicitly parameterizes window size and stride for each dimension in a -dimensional token grid and provides a simulator for hardware-specific speedup estimation. Practical guidelines suggest selecting sparsity (window) for desired inductive bias, matching tile sizes to hardware kernel granularity, and leveraging analytical roofline models for configuration.
Empirical results confirm that, with perfect block-aligned groupings and hardware support, block-sparse attention (i.e., STA/GNA) achieves operator and end-to-end speedups that are closely proportional to actual flop reductions and reach up to 46% end-to-end savings on foundational vision/generative models (HunyuanVideo, FLUX, Cosmos-7B) (Hassani et al., 23 Apr 2025). The simulator and Blackwell-optimized kernels are made available via NATTEN.
7. Integration Into Video Diffusion Pipelines and Future Directions
Efficient-vDiT integrates STA within a three-stage framework:
- Multi-Step Latent Consistency (MSLC) distillation reduces diffusion sampling steps (e.g., 100 to 20).
- Per-layer, per-head mask search maximizes sparsity while bounding error.
- Cross-model knowledge distillation aligns the outputs of sparse and full-attention student/teacher models.
Parallelism-friendly block structure enables distributed inference with further acceleration (3.91x on 4 GPUs), and the approach is applicable in both finetuning and direct application to pretrained models (Ding et al., 10 Feb 2025).
Limitations include the growing nontrivial costs of other layers (MLP, normalization, VAE decode), capping potential gains as attention accelerates, and the current focus on regular, cubic tiling. Open directions include autotuning window/stride per head, dynamic or data-driven tiling arrangements, and extending to longer, more complex input domains (Zhang et al., 6 Feb 2025, Ding et al., 10 Feb 2025, Hassani et al., 23 Apr 2025).