Papers
Topics
Authors
Recent
Search
2000 character limit reached

Additive GRAPE: Group-Theoretic Positional Encoding

Updated 9 December 2025
  • Additive GRAPE is a positional encoding framework that leverages unipotent group actions in GL(d+1) to achieve exact relative-position invariance.
  • It generalizes ALiBi and FoX by using homogeneous coordinates and rank-1 nilpotent generators to compute content-adaptive additive biases efficiently.
  • The approach ensures O(d) computational overhead, streaming cacheability, and improved stability for long-context Transformer models.

Additive GRAPE (Group RepresentAtional Position Encoding–Additive form), frequently denoted “GRAPE-A,” is a @@@@1@@@@ framework for long-context neural models based on unipotent group actions in the general linear group GL(d+1)\mathrm{GL}(d+1). GRAPE-A generalizes and rigorously grounds linear logit-bias mechanisms such as ALiBi and the Forgetting Transformer (FoX), providing exact relative-position invariance, streaming cacheability, and support for rank-RR and path-integral generalizations. Its construction leverages homogeneous coordinates, resulting in closed-form group actions that yield content-adaptive, translation-equivariant logit biases with negligible computational and memory overhead relative to standard multi-head attention (Zhang et al., 8 Dec 2025).

1. Group-Theoretic Foundations

GRAPE-A lifts each dd-dimensional query/key xx to homogeneous coordinates x^=[x 1]Rd+1\widehat x = \begin{bmatrix} x \ 1 \end{bmatrix} \in \mathbb{R}^{d+1}, allowing affine translations to be represented as linear transformations. The central group-theoretic ingredient is a rank-1 nilpotent generator,

A=[0d×du 01×d0],A2=0,A = \begin{bmatrix} 0_{d \times d} & u \ 0_{1 \times d} & 0 \end{bmatrix}, \quad A^2 = 0,

where uRdu \in \mathbb{R}^d is a learned translation vector. The corresponding one-parameter unipotent subgroup is given by the matrix exponential,

G(n)=exp(nωA)=I+nωA=[Idnωu 01]GL(d+1),G(n) = \exp(n \omega A) = I + n \omega A = \begin{bmatrix} I_d & n\omega u \ 0 & 1 \end{bmatrix} \in \mathrm{GL}(d+1),

which acts as translations in the original feature space. This group action satisfies composition and inversion laws G(n+m)=G(n)G(m)G(n+m)=G(n)G(m) and G(n)=G(n)1G(-n)=G(n)^{-1}, enforcing exact relativity by design.

2. Action on Attention Logits and Additive Bias

For input sequence positions ii and jj, the queries and keys are embedded as q^i=[qi;1]\widehat q_i = [q_i; 1], k^j=[kj;1]\widehat k_j = [k_j; 1]. The transformed queries and keys are given by

q~i=G(i)q^i,k~j=(G(j)1)k^j.\widetilde q_i = G(i) \widehat q_i, \qquad \widetilde k_j = (G(j)^{-1})^\top \widehat k_j.

The resulting attention logit is in closed form:

q~ik~j=q^iG(ji)k^j.\widetilde q_i^\top \widetilde k_j = \widehat q_i^\top G(j-i)^{-\,\top} \widehat k_j.

Given A2=0A^2=0, one has

G(m)=[Id0 mωu1],G(m)^{-\,\top} = \begin{bmatrix} I_d & 0 \ - m\omega u^\top & 1 \end{bmatrix},

so that

q~ik~j=qikj+1mω(ukj),\widetilde q_i^\top \widetilde k_j = q_i^\top k_j + 1 - m\,\omega(u^\top k_j),

with m=jim = j-i. The softmax-invariant constant +1+1 may be discarded, yielding a pure additive bias

b(m)=mω(ukj)b(m) = -m\,\omega(u^\top k_j)

which is content-adaptive. Generalization to RR pairwise commuting nilpotent generators ArA_r (ArAs=0A_r A_s=0) yields a sum of RR rank-1 bias terms, each of the form above.

3. Recovery of ALiBi and Forgetting Transformer (FoX)

GRAPE-A subsumes the static linear bias of ALiBi and the adaptive cumulative gating of FoX as exact special cases:

  • For ALiBi, a (d+2)(d+2)-dimensional lift is used:

q^i=[qi;1;0],k^j=[kj;0;1]\widehat q_i = [q_i; 1; 0], \quad \widehat k_j = [k_j; 0; 1]

with Ah=βhed+2ed+1A_h = \beta_h e_{d+2} e_{d+1}^\top for each attention head hh, yielding q~ik~j=qikjmβh\widetilde q_i^\top \widetilde k_j = q_i^\top k_j - m \beta_h, i.e., b(m)=(ji)βhb(m) = -(j-i)\beta_h, reproducing ALiBi’s static bias law.

  • For FoX, the additive bias is a sum of per-token log-forget gates ft,hf_{t,h},

Dij,h==j+1ilogf,h0,D_{ij, h} = \sum_{\ell = j+1}^{i} \log f_{\ell, h} \leq 0,

which is captured in GRAPE-A’s path-integral framework by letting the logit bias be bh(i,j)=Dij,hb_h(i, j) = D_{ij, h}, i.e., a sum of edge potentials ψh()=logf,h\psi_h(\ell) = \log f_{\ell, h} along the token chain. This is realized by a rank-1 unipotent lift and path product of transformations.

4. Relative Position Law and Streaming Cacheability

GRAPE-A enforces an exact relative law:

q~ik~j=q^iG(ji)k^j=F(qi,kj;ji),\widetilde q_i^\top \widetilde k_j = \widehat q_i^\top G(j-i)^{-\,\top} \widehat k_j = F(q_i, k_j; j-i),

so that attention logits depend only on jij-i. This structure permits efficient streaming: each key kjk_j is cached as

k~j=G(j)k^j=(IjωA)k^j\widetilde k_j^\star = G(j)^{-\,\top} \widehat k_j = (I - j\omega A^\top)\widehat k_j

at prefetch, never requiring update. Queries are transformed on arrival as q~i=G(i)q^i\widetilde q_i = G(i)\widehat q_i (O(d)O(d) complexity), and logits are computed by a simple dot-product i,j=q~ik~j\ell_{i, j} = \widetilde q_i^\top \widetilde k_j^\star (O(d)O(d) per head), facilitating streaming and memory-efficient inference.

5. Computational Complexity and Memory Requirements

Per-head computational and memory overheads are summarized as follows:

Mechanism Per-Step Compute Memory Cost Flexibility
ALiBi O(1)O(1) O(1)O(1) Static slope
FoX O(1)O(1) per gate cumulative gating, cumulative
GRAPE-A (rank-1) O(d)O(d) O(d)O(d) content, multi-bias
GRAPE-A (RR terms) O(Rd)O(R d) O(Rd)O(R d) multi-component

Key and query transformations each incur O(d)O(d) flops, identical to ALiBi and FoX. Multiple nilpotent generators scale linearly with rank RR. Only one learned uRdu \in \mathbb{R}^d per head and one (d+1)(d+1)-vector cache per key are stored, matching standard key-value caches. GRAPE-A thus generalizes prior mechanisms with full streaming support and O(d)O(d) overhead.

6. Stability, Equivariance, and Extensions

Each G(n)G(n) is unipotent with all eigenvalues =1=1, so the spectral radius is 1 and the paired inverse-transpose scoring exactly cancels anisotropy. This guarantees no hidden contraction or expansion of feature norms or volumes—additive bias only—contrasting with contextual affine schemes which can introduce numerical drift. The translation-equivariant and extrapolative nature of GRAPE-A ensures perfect handling of arbitrary relative positions jij-i.

The path-integral extension (GRAPE-AP) allows endpoint-dependent edge potentials ψh(t,)\psi_h(t, \ell) and causal additive biases bh(t,j)==j+1tψh(t,)b_h(t, j) = \sum_{\ell = j+1}^t \psi_h(t, \ell), generalizing FoX and retaining strict relative law. Empirical results (FineWeb-Edu) demonstrate GRAPE-A’s stability in training, avoidance of instabilities observed with pure RoPE, and consistent improvements in validation loss, particularly for long-context tasks.

Implementation is direct: k~j=kjjωu\widetilde k_j^\star = k_j - j\omega u, q~i=qi+iωu\widetilde q_i = q_i + i\omega u, followed by dot-product bias addition—requiring no complex matrix ops, special kernels, or departures from conventional Transformer design.

7. Context and Implications

Additive GRAPE places static and adaptive logit bias mechanisms on a firm group-theoretic foundation via unipotent subgroups of GL(d+1)\mathrm{GL}(d+1). It recovers ALiBi and FoX exactly, supports content-gating and multi-component bias, enforces strict relativity, remains computationally efficient, and achieves improved empirical stability and extrapolation performance. This suggests potential for further exploration of group actions in positional encoding, especially in domains requiring long-context or streaming architectures (Zhang et al., 8 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Additive GRAPE.