Geometry Self-Attention

Updated 22 January 2026

Geometry Self-Attention is a family of mechanisms that incorporate geometric cues—such as spatial offsets, scale, and symmetry—into attention to better capture spatial relationships.
It augments standard dot-product attention by integrating learned geometry embeddings, relative positional biases, and equivariant transformations to align with intrinsic data structures.
GSA methods have demonstrated improved sample efficiency and generalization in tasks like image captioning, semantic segmentation, and operator learning on irregular domains.

Geometry Self-Attention (GSA) refers to the family of attention mechanisms that explicitly encode, exploit, or are constrained by the geometric structure of inputs, enabling neural networks—especially Transformers and related architectures—to reason over spatial, geometric, or symmetry-dependent relationships. GSA generalizes standard self-attention by either augmenting attention weights with geometric priors (based on position, depth, geodesic distance, or symmetry group), biasing the dot-product score, fusing geometric cues at multiple scales, or fully reformulating attention in non-Euclidean (e.g., Riemannian) spaces. Such mechanisms have been deployed in image captioning, vision, point-cloud processing, semantic segmentation, robotics, operator learning on irregular domains, and manifold-valued feature spaces, leading to improvements in sample efficiency, generalization, and prediction consistency.

1. Geometric Augmentation of Dot-Product Attention

A foundational approach to GSA augments the scaled dot-product attention with learned functions of relative geometric information between tokens. In image captioning, each visual region is associated with a bounding box (center $(x_i, y_i)$ , width $w_i$ , height $h_i$ ). GSA constructs a relative geometry embedding $f^g_{ij} = [\log(|x_i - x_j|/w_i),\, \log(|y_i - y_j|/h_i),\, \log(w_i/w_j),\, \log(h_i/h_j)]^T$ capturing spatial offset and scale. This is projected through an MLP to $G_{ij}$ , and a learned geometric bias $\phi_{ij}$ —using variants such as content-independent, query-dependent, or key-dependent—is added to the dot-product score, yielding

$E_{ij} = Q_i \cdot K_j + \phi(G_{ij}, Q'_i, K'_j)$

where $Q'_i$ and $K'_j$ are geometric projections of query/key. Attention weights are then computed as usual via softmax and used to aggregate values. Query-dependent $\phi$ yields the strongest task performance, as it lets the network select spatial relations relevant to the query feature. Integration into Transformers occurs at the encoder self-attention blocks, requiring only minor additional computation and parameter overhead (e.g., $O(N^2 d_g)$ extra per head, with $d_g \ll d$ ) (Guo et al., 2020).

2. Group Symmetric and Equivariant Attention

A prominent strand of GSA explicitly enforces equivariance to spatial symmetries by encoding the action of finite transformation groups. For a group $G$ (e.g., translations, rotations, reflections), tokens are associated with positions $(\mathbf{x}, h)$ in the product space $\mathbb{R}^2 \times G$ . Attention weights are modulated by positional encodings $\rho((\mathbf{x}_i, h_i), (\mathbf{x}_j, h_j))$ that are invariant under $G$ : for all $g\in G$ ,

$\rho(g \cdot (\mathbf{x}_i, h_i),\, g \cdot (\mathbf{x}_j, h_j)) = \rho((\mathbf{x}_i, h_i),\, (\mathbf{x}_j, h_j))$

Ensuring invariance, the resulting self-attention module commutes with the group action and thus is equivariant by construction. Architectures (GSA-Nets) use a standard Transformer structure (lifting block, GSA block, feed-forward, pooling) and demonstrate consistent sample-efficiency improvements across tasks by aligning the parameterization with the problem’s symmetry group (Romero et al., 2020).

3. Geometry Priors in Attention Weights: Depth, Spatial, and Geodesic Cues

In RGB-D semantic segmentation, GSA fuses explicit depth and spatial relationships directly into the attention logits. For each patch, a depth prior $D_{ij}=|z_{ij}-z_{i'j'}|$ and a spatial prior $S_{ij}=|i-i'|+|j-j'|$ are combined (via small learned fusion) to produce a geometry prior $G_{ij}$ . This either adds a negative bias $-\alpha G_{ij}$ to the logits or applies a multiplicative decay $\beta^{G_{ij}}$ post-softmax, concentrating attention on geometrically proximate regions. Reduced computational cost is achieved in high-resolution stages via separable vertical/horizontal modulation. Empirical results on NYU DepthV2 and SUN RGB-D confirm state-of-the-art performance and strong ablation sensitivity to the choice and fusion method of geometric priors (Yin et al., 7 Apr 2025).

For point clouds, GSA introduces sphere-mapping of local patches, converting 3D points to spherical coordinates relative to a patch center, with features extracted via shared MLPs and aggregated with max-pooling for orientation invariance. Non-Euclidean relationships are further captured by computing shortest-path geodesic distances $d_G$ between points (intra-patch graphs), and injecting them as additive or multiplicative bias terms in the attention weights, thereby modulating information flow according to intrinsic surface geometry—critical for disordered, manifold data (Wei et al., 2023).

4. Integration with Irregular and Large-Scale Geometries

Operator learning and physical PDE surrogate modeling on complex, irregular domains require GSA variants that both support multi-scale geometric context and scale computationally for large $N$ . In the GALE attention module of GeoTransolver, slices of latent physical state tokens are augmented with geometry- and regime-conditioned embeddings, constructed via multi-scale ball queries that extract features from meshed geometries and boundary conditions. Attention in each block is a convex combination of slice-wise self-attention (physics-local) and cross-attention to the shared geometry/global context (physics-anchored), with the mixing parameterized by a data-dependent gating network. Explicit geometry pooling at multiple radii enables both near-boundary precision and non-local coupling. Ablations demonstrate gains in prediction accuracy, boundary fidelity, and robustness to out-of-distribution shifts in mesh and regime (Adams et al., 23 Dec 2025).

For large-scale point sets, Ball Sparse Attention (BSA) constructs neighborhoods via Ball Trees (partitioned hyperspheres of maximum leaf size $m$ ) and restricts self-attention to points in the same leaf, combining this with global branches (compressed and selection branches from NSA) for global receptive field at sub-quadratic cost. Experimentally, BSA approaches full-attention accuracy while scaling up to $N=65\,536$ with runtimes $5\times$ faster than quadratic attention (Brita et al., 14 Jun 2025).

5. GSA on Non-Euclidean and Manifold-Valued Features

In certain domains, such as SPD (symmetric positive-definite) matrices on Riemannian manifolds, GSA reformulates self-attention using manifold operations. Instead of vector dot-products, attention scoring is based on log-Euclidean distances between SPD matrices; aggregation uses weighted Fréchet means via logarithmic and exponential maps. This approach maintains the intrinsic geometry of feature representations, ensuring preservation of discriminative, non-diagonal structures even in deep networks and improving resistance to degradation prevalent in naive stacking. Empirical evaluations show that manifold-aware attention consistently enhances task performance across facial emotion, hand action, and skeleton-based action recognition datasets (Wang et al., 2023).

6. Theoretical Foundations: Function Space Geometry and Identifiability

The geometry of the function space realized by self-attention modules has been elucidated via algebraic geometry, interpreting the set of input-output maps as a polynomial manifold parameterized by attention weight matrices. For unnormalized self-attention, the function class has nontrivial fiber dimension (continuous parameter symmetries), but introducing softmax normalization collapses fibers generically to singletons, thus ensuring identifiability. The dimension of the “neuromanifold” is determined by parameter counts, layer dimension, and symmetry reduction. Stratification into regular, singular, and boundary loci clarifies expressivity, sample complexity, and optimization behavior, providing a rigorous geometric foundation for GSA and its parameterizations (Henry et al., 2024).

7. Empirical Impact and Task-Specific Variants

Geometry-aware self-attention variants have yielded consistent improvements across modalities. In image captioning, GSA (with query-dependent geometric bias) adds $2.8$ CIDEr (from 128.6 to 131.4) over baseline SAN on MS-COCO. Combining normalization and geometric bias (NG-SAN) achieves $+3.5$ CIDEr. Ablation across variant choices (content-independent, key-dependent, absolute encoding) confirms the benefit is both from architectural parameterization and the flexible, explicit geometric interaction (Guo et al., 2020).

In group-equivariant self-attention, encoding rotation/reflection symmetry achieves notable classification gains on rotMNIST, CIFAR-10, and PatchCamelyon datasets—up to 98.0% accuracy on rotMNIST for $R_{12}$ (order-12 rotation) symmetry. In monocular depth estimation, geometry-guided spatial-temporal attention reduces temporal consistency metric (TCM) error by 60% relative to the MonoDepth2 baseline, while also improving frame accuracy. Geometry-aware attention in multi-scale 3D and physical operator learning consistently delivers robust prediction improvements, more stable deep architectures, and scalable computation for large and irregular input domains (Romero et al., 2020, Ruhkamp et al., 2021, Wei et al., 2023, Yin et al., 7 Apr 2025, Adams et al., 23 Dec 2025, Brita et al., 14 Jun 2025).