Papers
Topics
Authors
Recent
Search
2000 character limit reached

Quasi-Linear Contextual Expressivity

Updated 17 December 2025
  • Quasi-Linear Contextual Expressivity is defined via a sliced ReLU kernel that uses one-dimensional projections and sorting-based prefix sums.
  • This approach achieves O(n log n) computational complexity while preserving universal approximation properties for modeling long contexts.
  • Empirical results highlight improved accuracy over softmax in long-range tasks with applications in NLP, vision, and geometric learning.

Sliced ReLU attention is a differentiable attention mechanism that achieves quasi-linear computational complexity while retaining strong theoretical expressivity. It departs significantly from both softmax attention and earlier ReLU-based alternatives by operating on learned one-dimensional projections of key–query differences and leveraging efficient sorting-based algorithms for evaluation. The construction yields a non-symmetric kernel well suited for scaling to very long contexts, while preserving, via explicit proofs, key universal approximation properties previously established for softmax attention (Boufadène et al., 12 Dec 2025).

1. Formal Definition and Mathematical Structure

Given an input sequence X=(x1,,xn)RdX = (x_1, \ldots, x_n) \subset \mathbb{R}^d and standard query (QQ), key (KK), and value (VV) projections Q,K,VRd×dQ, K, V \in \mathbb{R}^{d \times d}, together with a learned one-dimensional projection Π:RdR\Pi: \mathbb{R}^d \rightarrow \mathbb{R} (often a small MLP), the sliced ReLU kernel is defined for tokens xix_i (query) and xjx_j (key) as: si=Π(Qxi),tj=Π(Kxj)s_i = \Pi(Q x_i), \qquad t_j = \Pi(K x_j)

K(si,tj)=ReLU(sitj)=max{sitj,0}K(s_i, t_j) = \operatorname{ReLU}(s_i - t_j) = \max\{s_i - t_j,\, 0\}

This kernel is non-symmetric; that is, ReLU(sitj)ReLU(tjsi)\operatorname{ReLU}(s_i - t_j) \neq \operatorname{ReLU}(t_j - s_i) except in degenerate cases.

The normalized attention output applied at xix_i is: Aθ,ΠReLU(xi;X)=j=1nReLU(ΠQxiΠKxj)l=1nΠQxiΠKxl(Vxj1nm=1nVxm)\mathcal{A}_{\theta,\Pi}^{\operatorname{ReLU}}(x_i; X) = \sum_{j=1}^n \frac{\operatorname{ReLU}(\Pi Q x_i - \Pi K x_j)}{\sum_{l=1}^n|\Pi Q x_i - \Pi K x_l|} \left( V x_j - \frac{1}{n} \sum_{m=1}^n V x_m \right) Centering VxjV x_j by its empirical mean ensures the output lies in the subspace on which the (generally asymmetric) ReLU kernel is conditionally positive definite. The attention kernel, when rewritten via ReLU(xy)=12xy+12(xy)\operatorname{ReLU}(x-y) = \frac{1}{2}|x-y| + \frac{1}{2}(x-y), can be interpreted as an asymmetric variant of a one-dimensional Energy-Distance kernel.

2. Quasi-linear Algorithm via Sorting

While a naïve implementation would require O(n2)O(n^2) pairwise computations, the structure of the one-dimensional projection enables an O(nlogn)O(n \log n) algorithm. After computing scalar projection scores sis_i and projected values γi\gamma_i (e.g., γ(j)=Vx(j)Vˉ\gamma_{(j)} = V x_{(j)} - \bar{V}), the vectors are sorted. The raw attention output for each sorted position ii is: j=1nReLU(z(i)z(j))γ(j)=(jiγ(j))z(i)jiz(j)γ(j)\sum_{j=1}^n \operatorname{ReLU}(z_{(i)} - z_{(j)})\,\gamma_{(j)} = \left( \sum_{j \le i}\gamma_{(j)} \right) z_{(i)} - \sum_{j \le i} z_{(j)}\,\gamma_{(j)} These prefix sums over γ(j)\gamma_{(j)} and z(j)γ(j)z_{(j)}\,\gamma_{(j)} are computed in a single linear scan after sorting. The full algorithm sequentially computes and accumulates the necessary prefix sums, applies them in sorted order, and then unsorts the results to restore the original order. The resulting computational complexity is O(nlogn)O(n \log n) for both sorting and unsorting, and O(n)O(n) for all prefix sum operations.

3. Fundamental Theoretical Expressivity

3.1 Sequence-to-Sequence Disentangling

Sliced ReLU attention maintains the sequence-to-sequence disentangling property: for any pair of finite families of distinct sequences, there exists a composition of finitely many ReLU attention layers that maps each source sequence exactly to a corresponding target sequence without mixing. Formally, for dimension d2d \geq 2, two families {xi}i=1p,{yi}i=1pRd×n\{\mathbf x_i\}_{i=1}^p, \{\mathbf y_i\}_{i=1}^p \subset \mathbb{R}^{d \times n}, there exists a composition of at most $2p(n+1)-1$ sliced ReLU attention layers effecting this exact mapping. The proof uses iterative one-dimensional projections and the construction of localized ReLU "bumps" for precise token manipulation, with the layer bound resulting from the necessary "splitting" and "matching" passes.

3.2 Contextual Universal Approximation in the Mean-Field Limit

The contextual universality theorem extends to the mean-field regime. If Λ:Rd×P(Rd)Rd\Lambda^*: \mathbb{R}^d \times \mathcal{P}(\mathbb{R}^d) \rightarrow \mathbb{R}^{d'} is any continuous map on token–distribution pairs (for the weak-star topology on measures), then given ε>0\varepsilon > 0, a finite-depth composition of sliced ReLU attention heads and affine MLPs can uniformly approximate Λ\Lambda^* within ε\varepsilon. The proof combines Radon transform injectivity, the construction of an algebra from one-dimensional ReLU attention–MLP compositions, and the Stone–Weierstrass theorem. The resulting architecture is shown to be dense in C(Rd×P(Rd))C(\mathbb{R}^d\times\mathcal{P}(\mathbb{R}^d)).

4. Comparative Analysis with Other Attention Mechanisms

A comparison of several principal attention mechanisms highlights the unique positioning of sliced ReLU:

Mechanism/Feature Computational Cost Symmetry Global Expressivity
Softmax O(n2)O(n^2) Symmetric (modulo normalization) Universal
Linear approximations (Performers, etc.) O(n)O(n) Typically symmetric, sometimes approximate Often compromised
Direct ReLU on dot-product O(n2)O(n^2) Non-symmetric Less stable
SliceFormer/sorting-based sparse Variable Often non-differentiable Partial
Sliced ReLU attention O(nlogn)O(n\log n) Non-symmetric Universal

Sliced ReLU replaces the dd-dimensional softmax geometry with efficient 1D projections. In contrast to random-feature or low-rank approximations, it remains exact, global, and strongly expressive. Unlike sorting-based sparse alternatives, the method maintains differentiability and precise global attention.

5. Empirical Evaluation

Empirical benchmarks on several tasks reveal the practical trade-offs of the approach:

  • Long Range Arena (LRA): On tasks with up to 4K tokens, sliced ReLU achieves higher mean accuracy (62.9%) than softmax (59.8%) over five tasks, outperforming softmax on retrieval and Pathfinder, and underperforming slightly on ListOps and byte-text classification. Throughput gains in long contexts (ListOps 2K) are in the range of 1.4×1.4\times (inference) to 4×4\times (training), reflecting the computational advantage of the quasi-linear algorithm.
  • Tiny ViT on CIFAR-10 and Tiny ImageNet: Sliced ReLU-based models, particularly the localized ReLU-bump variant, closely track the accuracy of softmax-based ViTs under identical architectural configurations. The plain Sliced ReLU kernel performs slightly lower, but still matches or exceeds softmax in regimes with low parameter counts, despite ~10% increased parameter budget due to the MLP in Π\Pi.
  • ModelNet40 Point-Cloud Classification: In a Point Cloud Transformer setting (without local neighborhoods), softmax achieves 86.3% accuracy, the ReLU-bump kernel 85.4%, and the plain Sliced ReLU 76.2%. The result suggests that fine-grained geometric locality is better captured by more localized variants of the kernel.

A plausible implication is that sliced ReLU attention offers a favorable trade-off between accuracy, complexity, and global expressivity compared to existing mechanisms, especially in scenarios with very long input sequences.

6. Significance and Applications

Sliced ReLU attention provides a new mechanism for efficient, differentiable, and theoretically expressive sequence modeling. It is particularly suited for applications that require processing of very long contexts, due to its quasi-linear complexity and preserved universal approximation capabilities. The method enables scaling of attention-based models to longer sequences than previously tractable with conventional softmax, while avoiding the approximation pitfalls and differentiability breaks of prior linear or sparse methods.

Ongoing application areas include natural language modeling on long documents, vision transformers with large patch sets, and geometric learning tasks (e.g., point clouds), where computational and expressivity bottlenecks of previous attention mechanisms have proven limiting. Future directions may include further kernel variations, integration with adaptive projection mechanisms, and large-scale deployments in settings with massive token counts.


For additional structural, theoretical, or empirical details, see "Sliced ReLU attention: Quasi-linear contextual expressivity via sorting" (Boufadène et al., 12 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Quasi-Linear Contextual Expressivity.