Quasi-Linear Contextual Expressivity
- Quasi-Linear Contextual Expressivity is defined via a sliced ReLU kernel that uses one-dimensional projections and sorting-based prefix sums.
- This approach achieves O(n log n) computational complexity while preserving universal approximation properties for modeling long contexts.
- Empirical results highlight improved accuracy over softmax in long-range tasks with applications in NLP, vision, and geometric learning.
Sliced ReLU attention is a differentiable attention mechanism that achieves quasi-linear computational complexity while retaining strong theoretical expressivity. It departs significantly from both softmax attention and earlier ReLU-based alternatives by operating on learned one-dimensional projections of key–query differences and leveraging efficient sorting-based algorithms for evaluation. The construction yields a non-symmetric kernel well suited for scaling to very long contexts, while preserving, via explicit proofs, key universal approximation properties previously established for softmax attention (Boufadène et al., 12 Dec 2025).
1. Formal Definition and Mathematical Structure
Given an input sequence and standard query (), key (), and value () projections , together with a learned one-dimensional projection (often a small MLP), the sliced ReLU kernel is defined for tokens (query) and (key) as:
This kernel is non-symmetric; that is, except in degenerate cases.
The normalized attention output applied at is: Centering by its empirical mean ensures the output lies in the subspace on which the (generally asymmetric) ReLU kernel is conditionally positive definite. The attention kernel, when rewritten via , can be interpreted as an asymmetric variant of a one-dimensional Energy-Distance kernel.
2. Quasi-linear Algorithm via Sorting
While a naïve implementation would require pairwise computations, the structure of the one-dimensional projection enables an algorithm. After computing scalar projection scores and projected values (e.g., ), the vectors are sorted. The raw attention output for each sorted position is: These prefix sums over and are computed in a single linear scan after sorting. The full algorithm sequentially computes and accumulates the necessary prefix sums, applies them in sorted order, and then unsorts the results to restore the original order. The resulting computational complexity is for both sorting and unsorting, and for all prefix sum operations.
3. Fundamental Theoretical Expressivity
3.1 Sequence-to-Sequence Disentangling
Sliced ReLU attention maintains the sequence-to-sequence disentangling property: for any pair of finite families of distinct sequences, there exists a composition of finitely many ReLU attention layers that maps each source sequence exactly to a corresponding target sequence without mixing. Formally, for dimension , two families , there exists a composition of at most $2p(n+1)-1$ sliced ReLU attention layers effecting this exact mapping. The proof uses iterative one-dimensional projections and the construction of localized ReLU "bumps" for precise token manipulation, with the layer bound resulting from the necessary "splitting" and "matching" passes.
3.2 Contextual Universal Approximation in the Mean-Field Limit
The contextual universality theorem extends to the mean-field regime. If is any continuous map on token–distribution pairs (for the weak-star topology on measures), then given , a finite-depth composition of sliced ReLU attention heads and affine MLPs can uniformly approximate within . The proof combines Radon transform injectivity, the construction of an algebra from one-dimensional ReLU attention–MLP compositions, and the Stone–Weierstrass theorem. The resulting architecture is shown to be dense in .
4. Comparative Analysis with Other Attention Mechanisms
A comparison of several principal attention mechanisms highlights the unique positioning of sliced ReLU:
| Mechanism/Feature | Computational Cost | Symmetry | Global Expressivity |
|---|---|---|---|
| Softmax | Symmetric (modulo normalization) | Universal | |
| Linear approximations (Performers, etc.) | Typically symmetric, sometimes approximate | Often compromised | |
| Direct ReLU on dot-product | Non-symmetric | Less stable | |
| SliceFormer/sorting-based sparse | Variable | Often non-differentiable | Partial |
| Sliced ReLU attention | Non-symmetric | Universal |
Sliced ReLU replaces the -dimensional softmax geometry with efficient 1D projections. In contrast to random-feature or low-rank approximations, it remains exact, global, and strongly expressive. Unlike sorting-based sparse alternatives, the method maintains differentiability and precise global attention.
5. Empirical Evaluation
Empirical benchmarks on several tasks reveal the practical trade-offs of the approach:
- Long Range Arena (LRA): On tasks with up to 4K tokens, sliced ReLU achieves higher mean accuracy (62.9%) than softmax (59.8%) over five tasks, outperforming softmax on retrieval and Pathfinder, and underperforming slightly on ListOps and byte-text classification. Throughput gains in long contexts (ListOps 2K) are in the range of (inference) to (training), reflecting the computational advantage of the quasi-linear algorithm.
- Tiny ViT on CIFAR-10 and Tiny ImageNet: Sliced ReLU-based models, particularly the localized ReLU-bump variant, closely track the accuracy of softmax-based ViTs under identical architectural configurations. The plain Sliced ReLU kernel performs slightly lower, but still matches or exceeds softmax in regimes with low parameter counts, despite ~10% increased parameter budget due to the MLP in .
- ModelNet40 Point-Cloud Classification: In a Point Cloud Transformer setting (without local neighborhoods), softmax achieves 86.3% accuracy, the ReLU-bump kernel 85.4%, and the plain Sliced ReLU 76.2%. The result suggests that fine-grained geometric locality is better captured by more localized variants of the kernel.
A plausible implication is that sliced ReLU attention offers a favorable trade-off between accuracy, complexity, and global expressivity compared to existing mechanisms, especially in scenarios with very long input sequences.
6. Significance and Applications
Sliced ReLU attention provides a new mechanism for efficient, differentiable, and theoretically expressive sequence modeling. It is particularly suited for applications that require processing of very long contexts, due to its quasi-linear complexity and preserved universal approximation capabilities. The method enables scaling of attention-based models to longer sequences than previously tractable with conventional softmax, while avoiding the approximation pitfalls and differentiability breaks of prior linear or sparse methods.
Ongoing application areas include natural language modeling on long documents, vision transformers with large patch sets, and geometric learning tasks (e.g., point clouds), where computational and expressivity bottlenecks of previous attention mechanisms have proven limiting. Future directions may include further kernel variations, integration with adaptive projection mechanisms, and large-scale deployments in settings with massive token counts.
For additional structural, theoretical, or empirical details, see "Sliced ReLU attention: Quasi-linear contextual expressivity via sorting" (Boufadène et al., 12 Dec 2025).