On-the-Fly Equivariant Attention

Updated 30 January 2026

The paper introduces a neural mechanism that computes attention weights per sample, ensuring equivariance to symmetry groups like SO(3) and SE(3).
It dynamically constructs filters, projection operators, and geometric encodings on the fly, enabling robust performance on 3D meshes, point clouds, and graphs.
Empirical tests show high accuracy and rapid convergence in tasks such as mesh segmentation and shape classification, outperforming non-equivariant approaches.

On-the-fly equivariant attention is a class of neural attention mechanisms that perform dynamic, sample-by-sample computation of attention weights while guaranteeing equivariance to specific symmetry groups (such as $\mathrm{SO}(3)$ , $\mathrm{SE}(3)$ , translation, permutation, and gauge symmetries). Rather than relying on hardcoded or precomputed features/kernels, these mechanisms construct all necessary filters, projection operators, and weighting functions "on the fly" at each layer—typically leveraging dynamic geometric information (node positions, local frames, angular information, or spherical harmonics). The result is a flexible, expressive, and symmetry-respecting architecture applicable to a wide range of domains, including 3D meshes, point clouds, manifolds, graphs, and images.

1. Fundamental Principles of Equivariant Attention

The essence of equivariant attention lies in enforcing symmetry constraints within the computation of attention maps and feature updates. For a transformation group $G$ acting on the domain (e.g., rotations, translations, permutations), equivariance requires that the output of the attention layer transforms in a manner consistent with the input transformation:

$F(T_g(x)) = T_g(F(x)) ,$

where $T_g$ is the action of $g\in G$ and $F$ represents the layer. In attention architectures, this is enforced by:

Using features or kernels that transform according to irreducible representations (irreps) of $G$ .
Constructing attention logits and value aggregations via group-invariant operations, such as inner products (scalar invariants), relative positions, or projections onto appropriate bases.
Dynamically assembling projection kernels or message-passing rules for each sample and edge, using geometric or group-theoretic information available "on the fly".

A canonical modern implementation is found in Equivariant Mesh Attention Networks (EMAN), which achieves equivariance to $\mathrm{SO}(3)$ , translation, scaling, node permutation, and local gauge transformations by integrating relative tangential features and a gauge-equivariant Q/K/V attention mechanism (Basu et al., 2022).

2. Construction of On-the-Fly Equivariant Attention Layers

A general equivariant attention layer consists of the following stages, each constructed to commute with the relevant group actions:

Feature Preparation: Use features that are either invariant or transform under a known representation. In EMAN, relative tangential features (RelTan) serve as the input, constructed as area-weighted projections onto tangent planes, ensuring rotation equivariance and translation/scale invariance.

Q/K/V Projections: Define parameterized, equivariant linear (or steerable) maps for query, key, and value projections. These maps respect group actions via strict kernel constraints. For $SO(2)$ -gauge equivariance in EMAN: $K_{\mathrm{query}} = \rho_{\mathrm{att}}(-g) K_{\mathrm{query}} \rho_{\mathrm{in}}(g), \quad \forall g\in SO(2),$ with similar constraints for $K_{\mathrm{key}}(\theta)$ and $K_{\mathrm{value}}(\theta)$ (Basu et al., 2022).

Relative Geometric Encoding: For mesh or manifold data, compute all geometric quantities (edge angles $\theta_{pq}$ , gauge-parallel transport $g_{q \to p}$ , etc.) directly from the underlying geometry at each forward pass.

Attention Score Computation: Form inner products or other invariant operations between queries and keys (in the appropriate representation space), apply softmax within the neighborhood, and aggregate values. All aggregation, normalization, and reduction operations are constructed to be group-invariant.

Feature Update: The attention output is built as a weighted sum over neighbors, potentially with a normalization factor to recover convolutional operations in the special case of uniform weights.

The table below outlines core steps for one EMAN layer:

Step	Operation	Symmetry Guaranteed
RelTan Feature	Projection and weighted sum of normalized tangential directions	$\mathrm{SO}(3)$ -eq, translation/scale-inv
Q/K/V Projection	Linear, group-constrained maps in local tangent/gauge frames	Gauge, perm., rotation equivariance
Score Computation	Inner product in auxiliary representation, softmax	Group-invariant norm/aggreg
Value Aggregation	Weighted sum and normalization in output representation	Output transforms equivariantly
Output Transformation	Output vector in $\mathrm{SO}(2)$ or $\mathrm{SO}(3)$ representation	Consistent equivariance/invariance

3. Proofs and Guarantees of Equivariance

Rigorous equivariance is established by checking the transformation properties of each component under group action. In the EMAN architecture (Basu et al., 2022):

Gauge Equivariance: Local gauge changes induce conjugation on all features and projections, but due to the kernel constraints, queries, keys, and values transform consistently, and the attention scores (dot products) remain invariant.
Global Rotations: All geometric quantities (angles, transported features) are either unchanged or rotate accordingly; the entire layer output is rotated by $R$ if the input is.
Translations & Scalings: As all features and local bases are relative and normalized, the computations are insensitive to global translation or uniform scaling.
Permutations: All operations are local, summed over neighbors—the output per node is only influenced by the order of its neighborhood, ensuring permutation equivariance.

Generalization to higher-order symmetries or more complex manifolds follows the same group-theoretic structure, as seen in steerable attention on $\mathrm{SE}(3)$ (Chatzipantazis et al., 2022), group convolutional attention (Romero et al., 2020), and Clebsch–Gordan kernel-based models (Howell et al., 28 Sep 2025).

4. Computational Implementation and Efficiency

On-the-fly equivariant attention is implemented by dynamically computing all required features, projections, and kernels in each forward pass—there are no external precomputed tables or dependence on a fixed mesh or grid. Practical aspects include:

Batching and Matrix Multiplications: All per-node/edge computations are batched, typically realized as efficient GEMMs and small MLPs, fully compatible with modern accelerator architectures.
Local Neighborhoods: Computation is performed over precomputed local neighborhoods (e.g., via $k$ -nearest neighbors), limiting asymptotic cost to $O(Nk)$ for $N$ nodes.
Layer Pseudocode: In EMAN, for each node, Q/K/Vs are computed, neighbor projections are stacked, attention scores are softmaxed, and the output is aggregated using batched matmuls.
Runtime/Memory Efficiency: EMAN runs at approximately twice the per-epoch runtime of GEM-CNN, but attains higher accuracy within a fixed budget, as convergence rate is improved.
No Data Augmentation Required: Full symmetry is hard-coded, rendering test-time augmentation unnecessary for robustness (Basu et al., 2022).

5. Empirical Impact and Benchmark Performance

On-the-fly equivariant attention yields empirical improvements across diverse geometric learning tasks:

Mesh Segmentation & Shape Classification: On FAUST (segmentation) and TOSCA (shape classification), EMAN with RelTan features achieves 98.66% accuracy under all global symmetries on FAUST and 98.8% on TOSCA, compared to significant accuracy drops for models lacking full equivariance. Pure gauge-equivariant or vanilla attention models drop to 75–86% under similar test-time transformations.
Sample Efficiency and Generalization: Models achieve maximal accuracy without any data augmentation, indicating strong inductive bias and robustness.
Ablation Insights: Relative tangential features are necessary for full invariance to rotation/translation/scale. The equivariant attention layer further enhances representational power beyond convolution-only networks.
Trade-offs: Training and inference are modestly slower per epoch, but the per-accuracy cost is substantially lower due to rapid convergence (Basu et al., 2022).

6. Variants and Extensions

While EMAN is a representative prototype, on-the-fly equivariant attention is broadly instantiated in diverse frameworks:

Steerable Attention on $\mathrm{SE}(3)$ : Use of spherical harmonics, Clebsch–Gordan tensors, and steerable kernels for point cloud and volumetric data (Chatzipantazis et al., 2022, Fuchs et al., 2020).
Group-Attentive Convolutions: Merging group convolutions with data-dependent soft-masks, supporting attention in $G = \mathbb{R}^d \rtimes H$ for discrete and continuous groups (Romero et al., 2020).
Efficient Hardware-aware Implementations: Techniques such as axis-aligned sparsification and streaming-kernel reductions enable large-scale, memory-efficient training (e.g., with 20× TFLOPS improvement in E2Former-V2) (Huang et al., 23 Jan 2026).
Generalization to Arbitrary Manifolds: By adjusting the construction of local frames and relative features, the methodology extends to non-Euclidean and manifold-structured data.

7. Outlook and Ongoing Research

On-the-fly equivariant attention mechanisms serve as a unifying architecture in geometric deep learning, combining theoretical guarantees with robust empirical performance. The approach is compatible with modern scalable attention paradigms (e.g., FlashAttention-style streaming), supports full symmetry inductive biases without extensive augmentation, and demonstrates competitiveness on state-of-the-art physical and geometric benchmarks.

Ongoing work investigates scalability to higher-order and global attention, adaption to computational bottlenecks via group-theoretic sparsity (e.g., Clebsch–Gordan and Wigner-6j techniques), and extensions to increasingly broad symmetry groups and complex domains (Howell et al., 28 Sep 2025, Huang et al., 23 Jan 2026). Theoretical and hardware-awareness convergence has made on-the-fly equivariant attention a viable backbone for symmetric learning in scientific and technical applications.

References:

"Equivariant Mesh Attention Networks" (Basu et al., 2022)
"SE(3)-Equivariant Attention Networks for Shape Reconstruction in Function Space" (Chatzipantazis et al., 2022)
"Attentive Group Equivariant Convolutional Networks" (Romero et al., 2020)
"E2Former-V2: On-the-Fly Equivariant Attention with Linear Activation Memory" (Huang et al., 23 Jan 2026)
"Clebsch-Gordan Transformer: Fast and Global Equivariant Attention" (Howell et al., 28 Sep 2025)