Papers
Topics
Authors
Recent
Search
2000 character limit reached

Clustered Proximity Attention in Transformers

Updated 3 February 2026
  • Clustered Proximity Attention (CPA) is a fast self-attention mechanism that leverages query/key clustering to achieve linear time and memory complexity in sequence and spatial tasks.
  • CPA employs techniques like LSH, K-means clustering, and polar partitioning to restrict attention computations to a small set of candidate keys while ensuring bounded approximation error.
  • Empirical results in ASR, NLU, and routing demonstrate that CPA balances speed and accuracy, offering significant memory savings and tunable trade-offs based on application needs.

Clustered Proximity Attention (CPA) is a class of fast, sparsity-inducing self-attention mechanisms for Transformers that reduce the quadratic time and memory complexity of standard softmax-attention to linear in the sequence or node count. CPA algorithms achieve this by leveraging query/key grouping—through clustering or locality-aware partitioning—thereby restricting each attention computation to a small set of relevant candidates, while maintaining empirical accuracy and bounded approximation error in sequence modeling and combinatorial optimization contexts (Vyas et al., 2020, Basharzad et al., 27 Jan 2026).

1. Foundations of Clustered Proximity Attention

CPA was first introduced in the context of sequence modeling as a linear-time approximation to standard self-attention. In the traditional formulation, given queries Q∈RN×dQ\in\mathbb{R}^{N\times d}, keys K∈RN×dK\in\mathbb{R}^{N\times d}, and values V∈RN×DvV\in\mathbb{R}^{N\times D_v} for a sequence of length NN, the full attention computes the matrix:

A=softmax(QK⊤/d)A = \mathrm{softmax}(QK^\top / \sqrt{d})

with the output V^=AV\hat{V} = AV, requiring O(N2d)O(N^2 d) compute and O(N2)O(N^2) memory. This scaling is prohibitive for large sequences or graphs (Vyas et al., 2020).

CPA circumvents this by dividing queries into CC clusters (C≪N)(C\ll N), computing attention from cluster centroids to keys, and broadcasting the aggregate result. In spatial-decision tasks, such as vehicle routing, CPA uses geometric locality to form fixed-size spatial clusters and restrict the attention set of each node to its cluster plus special tokens (e.g., depot) (Basharzad et al., 27 Jan 2026).

2. Algorithms and Mathematical Formulations

The core CPA methodology differs slightly by domain and implementation. Two representative algorithms are as follows:

2.1. Sequence Modeling CPA

  • Clustering: Queries are assigned to clusters via Locality-Sensitive Hashing (LSH) to BB-bit codes, followed by K-means clustering in Hamming space. Each query belongs to exactly one cluster, represented by a partitioning matrix S∈{0,1}N×CS\in\{0,1\}^{N\times C}, with centroids Qjc=(1/nj)∑i=1NSijQiQ^c_j = (1/n_j)\sum_{i=1}^N S_{ij} Q_i.
  • Attention Calculation: Compute centroid-to-key attention Ac=softmax(QcK⊤/d)A^c = \mathrm{softmax}(Q^c K^\top/\sqrt{d}), then aggregate values V^c=AcV\hat{V}^c = A^c V.
  • Broadcasting: Each query ii inherits output V^i=∑j=1CSijV^jc\hat{V}_i = \sum_{j=1}^C S_{ij} \hat{V}^c_j.
  • Complexity: O(NCd+CNDv)O(NCd + CND_v) for centroid attention, O(NdB+NCL)O(NdB + NCL) for clustering (Vyas et al., 2020).

2.2. Geometric CPA for Vehicle Routing

  • Partitioning: Each node’s coordinates are transformed to polar form (di,θi)(d_i, \theta_i) around a depot and assigned a partitioning score pi=(1−α)θˉi+αdˉip_i = (1-\alpha) \bar{\theta}_i + \alpha \bar{d}_i (normalized angle/radius, mixing parameter α\alpha). Customers are sorted and cut into contiguous clusters of size MM, with K=⌈n/M⌉K = \lceil n/M\rceil clusters.
  • Attention Masking: For each attention head, node ii attends to its cluster and the depot, reducing complexity to O(nM)O(nM) per layer for constant MM.
  • Boundary Smoothing: Optional jitter added to pip_i smooths cluster boundaries between rounds (Basharzad et al., 27 Jan 2026).
Variant/Domain Clustering Mechanism Attention Scope Complexity per Layer
Sequence modeling (Vyas et al., 2020) LSH + K-means (Hamming) Cluster centroids, refined with top-m O(N(d+Dv))O(N(d+D_v))
Spatial routing (Basharzad et al., 27 Jan 2026) Polar partitioning + bucket sort Per cluster (size MM) plus depot O(nM)O(nM)

3. Error Analysis and Approximation Guarantees

CPA provides theoretical bounds on the approximation error induced by clustering:

  • If ∥Qi−Qjc∥2≤ϵ\|Q_i - Q^c_j\|_2 \leq \epsilon for cluster jj assigned to query ii, then

∥softmax(QiK⊤/d)−softmax(QjcK⊤/d)∥2=O(ϵ∥K∥2/d)\|\mathrm{softmax}(Q_i K^\top/\sqrt{d}) - \mathrm{softmax}(Q^c_j K^\top/\sqrt{d})\|_2 = O(\epsilon \|K\|_2/\sqrt{d})

Thus, attention error is small for queries close to their centroid (Vyas et al., 2020).

An improved variant, top-mm key refinement, selects for each cluster the mm keys with highest centroid attention and computes exact per-query/key dot products for those keys. Letting AtA^t be the refined attention and AA the full attention,

∥Ai,⋅t−Ai,⋅∥1≤∥Aj,⋅c−Ai,⋅∥1\|A^t_{i, \cdot} - A_{i, \cdot}\|_1 \leq \|A^c_{j, \cdot} - A_{i, \cdot}\|_1

i.e., refinement never increases attention error in L1L_1 (Vyas et al., 2020).

4. Implementation and Pseudocode

The CPA pipeline consolidates into the following steps (for sequence models):

  1. Project QQ to BB-bit codes via LSH.
  2. Cluster in Hamming space to form CC clusters.
  3. Compute centroids QcQ^c and centroid-to-key attention AcA^c.
  4. Broadcast cluster attention values to all member queries.
  5. For top-mm key refinement, identify top-mm keys per cluster and recompute exact attention for these per query.
  6. Aggregate final output as a sum of centroid-based and refined per-key values.

For geometric CPA in routing, node partitioning uses polar-based bucketization, with each attention head assigned to one of several partitioning rounds (varying α\alpha). Within each layer, projections and attention are computed as in standard Transformers but restricted to cluster-local keys and a single global token (depot). Boundary smoothing randomizes cluster assignments at edges, improving stability and performance (Basharzad et al., 27 Jan 2026).

5. Empirical Results and Applications

Automatic Speech Recognition (ASR)

  • On WSJ and Switchboard, improved CPA (i-CPA) attains 2× speed-up and up to 2% lower PER/WER relative to standard attention under equal FLOP or wall-clock budgets.
  • Convergence speed increases: ~50% reduction in GPU hours compared to vanilla Transformers (Vyas et al., 2020).

Natural Language Understanding (Finetuned BERT)

  • Using C=25C=25 clusters (<10%<10\% of sequence length), i-CPA matches RoBERTa accuracy across GLUE and SQuAD, losing less than 1% F1 (Vyas et al., 2020).

Combinatorial and Vehicle Routing Problems

  • SEAFormer with CPA achieves O(n)O(n) memory and computation, enabling training and inference on VRP instances with thousands of nodes (e.g., 5,000–7,000 customers), matching state-of-the-art divide-and-conquer methods in solution quality. Memory savings of 85–92% are reported for large instances compared to full attention. On VRP-100 tasks, multiple partitioning rounds and boundary smoothing further reduce optimality gaps to 0.56% (Basharzad et al., 27 Jan 2026).

6. Trade-offs, Hyperparameters, and Practical Considerations

CPA exposes several hyperparameters:

  • Number of clusters CC or cluster size MM: Affects memory/speed trade-off. Larger clusters improve solution quality but increase per-layer cost.
  • Top-mm key refinement (mm): Raising mm reduces approximation error at extra cost.
  • Partitioning rounds RR and mixing parameter α\alpha: Multiple rounds lead to more diverse local neighborhoods and better performance at marginally increased overhead. α\alpha interpolates between angular and radial clustering in spatial CPA.
  • Boundary smoothing width ww: Minor jitter in partitioning avoids hard allocation boundaries.
  • Implementation: Sparse attention routines (e.g., FlashAttention) are used for cluster-based masking, with minimal change to standard Transformer projections.

Empirical ablation demonstrates that cluster size M=50M=50 and R=2R=2 balancing speed and quality in large-scale routing; partitioning rounds and smoothing halve optimality gaps with minimal cost (Basharzad et al., 27 Jan 2026).

7. Applications, Limitations, and Future Directions

CPA facilitates scalable Transformers in domains where full attention is prohibitive:

Limitations include slight quality drops at extreme compression ratios (very small CC, MM) and task-specific clustering requirements. For best performance, hyperparameters may require tuning per domain and instance size. Extensions to non-Euclidean metrics or dynamic graphs remain open for further research.

CPA achieves substantial reductions in compute and memory overhead with well-bounded approximation error, and functions as a drop-in replacement for full attention in large-scale Transformer models, unlocking applications previously infeasible due to resource constraints (Vyas et al., 2020, Basharzad et al., 27 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Clustered Proximity Attention (CPA).