Clustered Proximity Attention in Transformers

Updated 3 February 2026

Clustered Proximity Attention (CPA) is a fast self-attention mechanism that leverages query/key clustering to achieve linear time and memory complexity in sequence and spatial tasks.
CPA employs techniques like LSH, K-means clustering, and polar partitioning to restrict attention computations to a small set of candidate keys while ensuring bounded approximation error.
Empirical results in ASR, NLU, and routing demonstrate that CPA balances speed and accuracy, offering significant memory savings and tunable trade-offs based on application needs.

Clustered Proximity Attention (CPA) is a class of fast, sparsity-inducing self-attention mechanisms for Transformers that reduce the quadratic time and memory complexity of standard softmax-attention to linear in the sequence or node count. CPA algorithms achieve this by leveraging query/key grouping—through clustering or locality-aware partitioning—thereby restricting each attention computation to a small set of relevant candidates, while maintaining empirical accuracy and bounded approximation error in sequence modeling and combinatorial optimization contexts (Vyas et al., 2020, Basharzad et al., 27 Jan 2026).

1. Foundations of Clustered Proximity Attention

CPA was first introduced in the context of sequence modeling as a linear-time approximation to standard self-attention. In the traditional formulation, given queries $Q\in\mathbb{R}^{N\times d}$ , keys $K\in\mathbb{R}^{N\times d}$ , and values $V\in\mathbb{R}^{N\times D_v}$ for a sequence of length $N$ , the full attention computes the matrix:

$A = \mathrm{softmax}(QK^\top / \sqrt{d})$

with the output $\hat{V} = AV$ , requiring $O(N^2 d)$ compute and $O(N^2)$ memory. This scaling is prohibitive for large sequences or graphs (Vyas et al., 2020).

CPA circumvents this by dividing queries into $C$ clusters $(C\ll N)$ , computing attention from cluster centroids to keys, and broadcasting the aggregate result. In spatial-decision tasks, such as vehicle routing, CPA uses geometric locality to form fixed-size spatial clusters and restrict the attention set of each node to its cluster plus special tokens (e.g., depot) (Basharzad et al., 27 Jan 2026).

2. Algorithms and Mathematical Formulations

The core CPA methodology differs slightly by domain and implementation. Two representative algorithms are as follows:

2.1. Sequence Modeling CPA

Clustering: Queries are assigned to clusters via Locality-Sensitive Hashing (LSH) to $B$ -bit codes, followed by K-means clustering in Hamming space. Each query belongs to exactly one cluster, represented by a partitioning matrix $S\in\{0,1\}^{N\times C}$ , with centroids $Q^c_j = (1/n_j)\sum_{i=1}^N S_{ij} Q_i$ .
Attention Calculation: Compute centroid-to-key attention $A^c = \mathrm{softmax}(Q^c K^\top/\sqrt{d})$ , then aggregate values $\hat{V}^c = A^c V$ .
Broadcasting: Each query $i$ inherits output $\hat{V}_i = \sum_{j=1}^C S_{ij} \hat{V}^c_j$ .
Complexity: $O(NCd + CND_v)$ for centroid attention, $O(NdB + NCL)$ for clustering (Vyas et al., 2020).

2.2. Geometric CPA for Vehicle Routing

Partitioning: Each node’s coordinates are transformed to polar form $(d_i, \theta_i)$ around a depot and assigned a partitioning score $p_i = (1-\alpha) \bar{\theta}_i + \alpha \bar{d}_i$ (normalized angle/radius, mixing parameter $\alpha$ ). Customers are sorted and cut into contiguous clusters of size $M$ , with $K = \lceil n/M\rceil$ clusters.
Attention Masking: For each attention head, node $i$ attends to its cluster and the depot, reducing complexity to $O(nM)$ per layer for constant $M$ .
Boundary Smoothing: Optional jitter added to $p_i$ smooths cluster boundaries between rounds (Basharzad et al., 27 Jan 2026).

Variant/Domain	Clustering Mechanism	Attention Scope	Complexity per Layer
Sequence modeling (Vyas et al., 2020)	LSH + K-means (Hamming)	Cluster centroids, refined with top-m	$O(N(d+D_v))$
Spatial routing (Basharzad et al., 27 Jan 2026)	Polar partitioning + bucket sort	Per cluster (size $M$ ) plus depot	$O(nM)$

3. Error Analysis and Approximation Guarantees

CPA provides theoretical bounds on the approximation error induced by clustering:

If $\|Q_i - Q^c_j\|_2 \leq \epsilon$ for cluster $j$ assigned to query $i$ , then

$\|\mathrm{softmax}(Q_i K^\top/\sqrt{d}) - \mathrm{softmax}(Q^c_j K^\top/\sqrt{d})\|_2 = O(\epsilon \|K\|_2/\sqrt{d})$

Thus, attention error is small for queries close to their centroid (Vyas et al., 2020).

An improved variant, top- $m$ key refinement, selects for each cluster the $m$ keys with highest centroid attention and computes exact per-query/key dot products for those keys. Letting $A^t$ be the refined attention and $A$ the full attention,

$\|A^t_{i, \cdot} - A_{i, \cdot}\|_1 \leq \|A^c_{j, \cdot} - A_{i, \cdot}\|_1$

i.e., refinement never increases attention error in $L_1$ (Vyas et al., 2020).

4. Implementation and Pseudocode

The CPA pipeline consolidates into the following steps (for sequence models):

Project $Q$ to $B$ -bit codes via LSH.
Cluster in Hamming space to form $C$ clusters.
Compute centroids $Q^c$ and centroid-to-key attention $A^c$ .
Broadcast cluster attention values to all member queries.
For top- $m$ key refinement, identify top- $m$ keys per cluster and recompute exact attention for these per query.
Aggregate final output as a sum of centroid-based and refined per-key values.

For geometric CPA in routing, node partitioning uses polar-based bucketization, with each attention head assigned to one of several partitioning rounds (varying $\alpha$ ). Within each layer, projections and attention are computed as in standard Transformers but restricted to cluster-local keys and a single global token (depot). Boundary smoothing randomizes cluster assignments at edges, improving stability and performance (Basharzad et al., 27 Jan 2026).

5. Empirical Results and Applications

Automatic Speech Recognition (ASR)

On WSJ and Switchboard, improved CPA (i-CPA) attains 2× speed-up and up to 2% lower PER/WER relative to standard attention under equal FLOP or wall-clock budgets.
Convergence speed increases: ~50% reduction in GPU hours compared to vanilla Transformers (Vyas et al., 2020).

Natural Language Understanding (Finetuned BERT)

Using $C=25$ clusters ( $<10\%$ of sequence length), i-CPA matches RoBERTa accuracy across GLUE and SQuAD, losing less than 1% F1 (Vyas et al., 2020).

Combinatorial and Vehicle Routing Problems

SEAFormer with CPA achieves $O(n)$ memory and computation, enabling training and inference on VRP instances with thousands of nodes (e.g., 5,000–7,000 customers), matching state-of-the-art divide-and-conquer methods in solution quality. Memory savings of 85–92% are reported for large instances compared to full attention. On VRP-100 tasks, multiple partitioning rounds and boundary smoothing further reduce optimality gaps to 0.56% (Basharzad et al., 27 Jan 2026).

6. Trade-offs, Hyperparameters, and Practical Considerations

CPA exposes several hyperparameters:

Number of clusters $C$ or cluster size $M$ : Affects memory/speed trade-off. Larger clusters improve solution quality but increase per-layer cost.
Top- $m$ key refinement ( $m$ ): Raising $m$ reduces approximation error at extra cost.
Partitioning rounds $R$ and mixing parameter $\alpha$ : Multiple rounds lead to more diverse local neighborhoods and better performance at marginally increased overhead. $\alpha$ interpolates between angular and radial clustering in spatial CPA.
Boundary smoothing width $w$ : Minor jitter in partitioning avoids hard allocation boundaries.
Implementation: Sparse attention routines (e.g., FlashAttention) are used for cluster-based masking, with minimal change to standard Transformer projections.

Empirical ablation demonstrates that cluster size $M=50$ and $R=2$ balancing speed and quality in large-scale routing; partitioning rounds and smoothing halve optimality gaps with minimal cost (Basharzad et al., 27 Jan 2026).

7. Applications, Limitations, and Future Directions

CPA facilitates scalable Transformers in domains where full attention is prohibitive:

Generative and discriminative sequence modeling (ASR, BERT finetuning) (Vyas et al., 2020).
Large-scale combinatorial optimization, especially routing on spatial graphs (Basharzad et al., 27 Jan 2026).

Limitations include slight quality drops at extreme compression ratios (very small $C$ , $M$ ) and task-specific clustering requirements. For best performance, hyperparameters may require tuning per domain and instance size. Extensions to non-Euclidean metrics or dynamic graphs remain open for further research.

CPA achieves substantial reductions in compute and memory overhead with well-bounded approximation error, and functions as a drop-in replacement for full attention in large-scale Transformer models, unlocking applications previously infeasible due to resource constraints (Vyas et al., 2020, Basharzad et al., 27 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (2)

Fast Transformers with Clustered Attention (2020)

SEAFormer: A Spatial Proximity and Edge-Aware Transformer for Real-World Vehicle Routing Problems (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Clustered Proximity Attention (CPA).