Additive GRAPE: Group-Theoretic Positional Encoding

Updated 9 December 2025

Additive GRAPE is a positional encoding framework that leverages unipotent group actions in GL(d+1) to achieve exact relative-position invariance.
It generalizes ALiBi and FoX by using homogeneous coordinates and rank-1 nilpotent generators to compute content-adaptive additive biases efficiently.
The approach ensures O(d) computational overhead, streaming cacheability, and improved stability for long-context Transformer models.

Additive GRAPE (Group RepresentAtional Position Encoding–Additive form), frequently denoted “GRAPE-A,” is a @@@@1@@@@ framework for long-context neural models based on unipotent group actions in the general linear group $\mathrm{GL}(d+1)$ . GRAPE-A generalizes and rigorously grounds linear logit-bias mechanisms such as ALiBi and the Forgetting Transformer (FoX), providing exact relative-position invariance, streaming cacheability, and support for rank- $R$ and path-integral generalizations. Its construction leverages homogeneous coordinates, resulting in closed-form group actions that yield content-adaptive, translation-equivariant logit biases with negligible computational and memory overhead relative to standard multi-head attention (Zhang et al., 8 Dec 2025).

1. Group-Theoretic Foundations

GRAPE-A lifts each $d$ -dimensional query/key $x$ to homogeneous coordinates $\widehat x = \begin{bmatrix} x \ 1 \end{bmatrix} \in \mathbb{R}^{d+1}$ , allowing affine translations to be represented as linear transformations. The central group-theoretic ingredient is a rank-1 nilpotent generator,

$A = \begin{bmatrix} 0_{d \times d} & u \ 0_{1 \times d} & 0 \end{bmatrix}, \quad A^2 = 0,$

where $u \in \mathbb{R}^d$ is a learned translation vector. The corresponding one-parameter unipotent subgroup is given by the matrix exponential,

$G(n) = \exp(n \omega A) = I + n \omega A = \begin{bmatrix} I_d & n\omega u \ 0 & 1 \end{bmatrix} \in \mathrm{GL}(d+1),$

which acts as translations in the original feature space. This group action satisfies composition and inversion laws $G(n+m)=G(n)G(m)$ and $G(-n)=G(n)^{-1}$ , enforcing exact relativity by design.

2. Action on Attention Logits and Additive Bias

For input sequence positions $i$ and $j$ , the queries and keys are embedded as $\widehat q_i = [q_i; 1]$ , $\widehat k_j = [k_j; 1]$ . The transformed queries and keys are given by

$\widetilde q_i = G(i) \widehat q_i, \qquad \widetilde k_j = (G(j)^{-1})^\top \widehat k_j.$

The resulting attention logit is in closed form:

$\widetilde q_i^\top \widetilde k_j = \widehat q_i^\top G(j-i)^{-\,\top} \widehat k_j.$

Given $A^2=0$ , one has

$G(m)^{-\,\top} = \begin{bmatrix} I_d & 0 \ - m\omega u^\top & 1 \end{bmatrix},$

so that

$\widetilde q_i^\top \widetilde k_j = q_i^\top k_j + 1 - m\,\omega(u^\top k_j),$

with $m = j-i$ . The softmax-invariant constant $+1$ may be discarded, yielding a pure additive bias

$b(m) = -m\,\omega(u^\top k_j)$

which is content-adaptive. Generalization to $R$ pairwise commuting nilpotent generators $A_r$ ( $A_r A_s=0$ ) yields a sum of $R$ rank-1 bias terms, each of the form above.

3. Recovery of ALiBi and Forgetting Transformer (FoX)

GRAPE-A subsumes the static linear bias of ALiBi and the adaptive cumulative gating of FoX as exact special cases:

For ALiBi, a $(d+2)$ -dimensional lift is used:

$\widehat q_i = [q_i; 1; 0], \quad \widehat k_j = [k_j; 0; 1]$

with $A_h = \beta_h e_{d+2} e_{d+1}^\top$ for each attention head $h$ , yielding $\widetilde q_i^\top \widetilde k_j = q_i^\top k_j - m \beta_h$ , i.e., $b(m) = -(j-i)\beta_h$ , reproducing ALiBi’s static bias law.

For FoX, the additive bias is a sum of per-token log-forget gates $f_{t,h}$ ,

$D_{ij, h} = \sum_{\ell = j+1}^{i} \log f_{\ell, h} \leq 0,$

which is captured in GRAPE-A’s path-integral framework by letting the logit bias be $b_h(i, j) = D_{ij, h}$ , i.e., a sum of edge potentials $\psi_h(\ell) = \log f_{\ell, h}$ along the token chain. This is realized by a rank-1 unipotent lift and path product of transformations.

4. Relative Position Law and Streaming Cacheability

GRAPE-A enforces an exact relative law:

$\widetilde q_i^\top \widetilde k_j = \widehat q_i^\top G(j-i)^{-\,\top} \widehat k_j = F(q_i, k_j; j-i),$

so that attention logits depend only on $j-i$ . This structure permits efficient streaming: each key $k_j$ is cached as

$\widetilde k_j^\star = G(j)^{-\,\top} \widehat k_j = (I - j\omega A^\top)\widehat k_j$

at prefetch, never requiring update. Queries are transformed on arrival as $\widetilde q_i = G(i)\widehat q_i$ ( $O(d)$ complexity), and logits are computed by a simple dot-product $\ell_{i, j} = \widetilde q_i^\top \widetilde k_j^\star$ ( $O(d)$ per head), facilitating streaming and memory-efficient inference.

5. Computational Complexity and Memory Requirements

Per-head computational and memory overheads are summarized as follows:

Mechanism	Per-Step Compute	Memory Cost	Flexibility
ALiBi	$O(1)$	$O(1)$	Static slope
FoX	$O(1)$ per gate	cumulative	gating, cumulative
GRAPE-A (rank-1)	$O(d)$	$O(d)$	content, multi-bias
GRAPE-A ( $R$ terms)	$O(R d)$	$O(R d)$	multi-component

Key and query transformations each incur $O(d)$ flops, identical to ALiBi and FoX. Multiple nilpotent generators scale linearly with rank $R$ . Only one learned $u \in \mathbb{R}^d$ per head and one $(d+1)$ -vector cache per key are stored, matching standard key-value caches. GRAPE-A thus generalizes prior mechanisms with full streaming support and $O(d)$ overhead.

6. Stability, Equivariance, and Extensions

Each $G(n)$ is unipotent with all eigenvalues $=1$ , so the spectral radius is 1 and the paired inverse-transpose scoring exactly cancels anisotropy. This guarantees no hidden contraction or expansion of feature norms or volumes—additive bias only—contrasting with contextual affine schemes which can introduce numerical drift. The translation-equivariant and extrapolative nature of GRAPE-A ensures perfect handling of arbitrary relative positions $j-i$ .

The path-integral extension (GRAPE-AP) allows endpoint-dependent edge potentials $\psi_h(t, \ell)$ and causal additive biases $b_h(t, j) = \sum_{\ell = j+1}^t \psi_h(t, \ell)$ , generalizing FoX and retaining strict relative law. Empirical results (FineWeb-Edu) demonstrate GRAPE-A’s stability in training, avoidance of instabilities observed with pure RoPE, and consistent improvements in validation loss, particularly for long-context tasks.

Implementation is direct: $\widetilde k_j^\star = k_j - j\omega u$ , $\widetilde q_i = q_i + i\omega u$ , followed by dot-product bias addition—requiring no complex matrix ops, special kernels, or departures from conventional Transformer design.

7. Context and Implications

Additive GRAPE places static and adaptive logit bias mechanisms on a firm group-theoretic foundation via unipotent subgroups of $\mathrm{GL}(d+1)$ . It recovers ALiBi and FoX exactly, supports content-gating and multi-component bias, enforces strict relativity, remains computationally efficient, and achieves improved empirical stability and extrapolation performance. This suggests potential for further exploration of group actions in positional encoding, especially in domains requiring long-context or streaming architectures (Zhang et al., 8 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Group Representational Position Encoding (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Additive GRAPE.

Additive GRAPE: Group-Theoretic Positional Encoding

1. Group-Theoretic Foundations

2. Action on Attention Logits and Additive Bias

3. Recovery of ALiBi and Forgetting Transformer (FoX)

4. Relative Position Law and Streaming Cacheability

5. Computational Complexity and Memory Requirements

6. Stability, Equivariance, and Extensions

7. Context and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Additive GRAPE: Group-Theoretic Positional Encoding

1. Group-Theoretic Foundations

2. Action on Attention Logits and Additive Bias

3. Recovery of ALiBi and Forgetting Transformer (FoX)

4. Relative Position Law and Streaming Cacheability

5. Computational Complexity and Memory Requirements

6. Stability, Equivariance, and Extensions

7. Context and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research