Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multiplicative GRAPE: Lie Group Position Encoding

Updated 9 December 2025
  • Multiplicative GRAPE is a Lie group-based positional encoding that leverages multiplicative rotations to provide norm-preserving, relative, and compositional mappings.
  • It employs a rank-2 skew generator and closed-form Rodrigues-type solutions to enforce exact relative attention laws with efficient O(d) complexity.
  • Empirical results demonstrate stable training and improved generalization in long-context models compared to traditional block-diagonal approaches.

Multiplicative GRAPE (GRAPE-M) is a positional encoding mechanism grounded in Lie group actions, specifically utilizing multiplicative rotations from the special orthogonal group SO(d)\mathrm{SO}(d). It provides a rigorous, parameterized family of norm-preserving, relative, and compositional position mappings for attention-based neural architectures. GRAPE-M generalizes and subsumes the Rotary Position Embedding (RoPE) by extending beyond canonical block-diagonal rotations to include learned multi-subspace rotations and compact mixtures of non-commuting group actions with tractable computational overhead. It is a central component of the GRAPE (Group RepresentAtional Position Encoding) framework for positional geometry in long-context models, admitting closed-form solutions and strict relative attention laws (Zhang et al., 8 Dec 2025).

1. Mathematical Construction: Rank-2 Skew Generator

At the core of Multiplicative GRAPE is a rank-2 skew-symmetric generator L(a,b)=abbaso(d)L(a,b) = a b^\top - b a^\top \in \mathfrak{so}(d), where a,bRda, b \in \mathbb{R}^d. For a,ba, b fixed:

  • Compute α=a2\alpha = \|a\|^2, β=b2\beta = \|b\|^2, γ=ab\gamma = a^\top b
  • Define Δ=αβγ20\Delta = \alpha\beta - \gamma^2 \geq 0, s=Δs = \sqrt{\Delta}
  • LL acts nontrivially only on U=span{a,b}U = \operatorname{span}\{a, b\} and satisfies L2=s2PUL^2 = -s^2 P_U, where PUP_U is the orthogonal projector onto UU

The spectrum of LL is {±is,0,...,0}\{\pm is, 0, ..., 0\}, ensuring that exp(nωL)\exp(n\omega L) is a rotation in UU for frequency ω>0\omega > 0 and token index nZn \in \mathbb{Z} (or nRn \in \mathbb{R} for continuous positions).

2. One-Parameter Subgroup, Relative Law, and Closed-Form Exponential

GRAPE-M defines the multiplicative positional map as G(n)=exp(nωL)SO(d)G(n) = \exp(n\omega L) \in \mathrm{SO}(d). The exponential map generates a one-parameter subgroup:

  • G(n+m)=G(n)G(m)G(n+m) = G(n)G(m)
  • G(0)=IG(0) = I
  • G(n)=G(n)=G(n)1G(-n) = G(n)^\top = G(n)^{-1}

Applied to attention, if qiq_i and kjk_j are query and key embeddings at positions ii and jj, GRAPE-M enforces

(G(i)qi)(G(j)kj)=qiG(ji)kj,(G(i)q_i)^\top (G(j)k_j) = q_i^\top G(j-i)k_j,

giving strictly relative positional dependence.

The matrix exponential yields a closed-form “Rodrigues”-type solution:

exp(ηL)=I+sinsηsL+1cossηs2L2.\exp(\eta L) = I + \frac{\sin s\eta}{s}L + \frac{1-\cos s\eta}{s^2}L^2.

This formula achieves a planar rotation by ηs\eta s in UU and identity outside UU.

3. Recovery of RoPE and the Canonical Case

RoPE arises as a special case where b=Jab = J a, with JJ the canonical block-diagonal complex structure (J2=IJ^2 = -I; 90° blocks). Taking aa and bb such that a=b=1\|a\| = \|b\| = 1, ab=0a^\top b = 0, and s=1s=1, the generator satisfies L(a,Ja)U=JUL(a,J a)|_U = -J|_U and L2=PUL^2 = -P_U. The resulting position map is

G(n)=I+sin(nω)L+(1cos(nω))L2=I(1cos(nω))PUsin(nω)JPU.G(n) = I + \sin(n\omega) L + (1 - \cos(n\omega)) L^2 = I - (1-\cos(n\omega))P_U - \sin(n\omega) J P_U.

For d/2d/2 disjoint planes (canonical coordinate pairs) and per-plane frequencies θi\theta_i,

Ltot=i=1d/2θiL(e2i1,e2i),L_\mathrm{tot} = \sum_{i=1}^{d/2} \theta_i L(e_{2i-1}, e_{2i}),

and G(n)=i=1d/2exp(nθiLi)G(n) = \prod_{i=1}^{d/2} \exp(n\theta_i L_i) is block-diagonal. Log-uniform choice θi=ω0log(100002i/d)\theta_i = \omega_0\log(10000^{2i/d}) exactly recovers the RoPE spectrum.

4. Extensions: Learned Commuting Subspaces and Non-Commuting Mixtures

Learned Commuting Subspaces

Multiplicative GRAPE naturally extends to learned multi-plane actions:

  • Let BSO(d)B \in \mathrm{SO}(d) (orthogonal basis), select planes UiU_i via [e2i1,e2i][e_{2i-1}, e_{2i}]
  • Li=BUiJUiBL_i = B U_i J U_i^\top B^\top (mutually orthogonal, [Li,Lj]=0[L_i, L_j] = 0)
  • Aggregate L=i=1d/2θiLiL = \sum_{i=1}^{d/2} \theta_i L_i, G(n)=exp(nL)=iexp(nθiLi)G(n) = \exp(nL) = \prod_i \exp(n \theta_i L_i)

This allows learning per-head or per-plane spectra while maintaining O(d)O(d) computational and memory cost per head. Each token's features are rotated independently in d/2d/2 planes.

Compact Non-Commuting Mixtures

To capture cross-subspace structure, GRAPE-M provides compact mixtures of mm non-commuting rank-2 generators:

  • Lr=j=1mωjL(aj,bj)L_r = \sum_{j=1}^m \omega_j L(a_j, b_j) (span of all {aj,bj}\{a_j,b_j\} has dimension r=2mr=2m)
  • Project features into the rr-dimensional subspace using an orthonormal basis ERd×rE \in \mathbb{R}^{d\times r}
  • Compress LrL_r to LrUso(r)L_r^U\in\mathfrak{so}(r), compute real-Schur decomposition LrU=T(t=1mθtJ)TL_r^U = T (\oplus_{t=1}^m \theta_t J) T^\top
  • Rotate in each $2$-plane by nθtn\theta_t

The operation requires O(rd)O(r d) cost per head and enables non-commuting, coupled feature mixing beyond block-diagonal structure.

5. Pseudocode Overview

Two core algorithms underpin efficient implementation:

Algorithm Description Complexity
Algorithm 1 Commuting Multi-Subspace GRAPE-M: apply d/2d/2 planar rotations via orthogonal BB O(d)O(d) per head
Algorithm 2 Fast Non-Commuting via Schur: compress to rr-dim, so(r)\mathfrak{so}(r), planar rotations in mixed subspaces O(rd)O(r d) per head

Algorithmic Steps:

  1. For commuting case, rotate each coordinate pair by cosθ\cos\theta, sinθ\sin\theta, as parameterized by position and frequency.
  2. For non-commuting case, project to a lower-dimensional subspace, apply multiple coupled planar rotations, and project back.

6. Initialization, Stability, and Computational Considerations

  • Each plane generator is gauge-fixed with a=b=1\|a\| = \|b\| = 1, ab=0a^\top b = 0; scale is absorbed into ω\omega.
  • Frequencies ω\omega or {θi}\{\theta_i\} can be fixed (e.g., log-uniform) or learned per head/plane.
  • The closed-form uses f1(z)=sinz/zf_1(z) = \sin z / z, f2(z)=(1cosz)/z2f_2(z) = (1-\cos z)/z^2 with Taylor expansions guarding z0z\to 0 for numerical stability.
  • Differentiation with respect to ω\omega, aa, bb yields analytic, stable gradients.
  • Complexity: GRAPE-M achieves O(d)O(d) time, O(d)O(d) memory, and O(d)O(d) parameters per head (contrasted with O(d3)O(d^3) and O(d2)O(d^2) for full-matrix exponentials).
  • Compared to block-diagonal RoPE, GRAPE-M enables richer feature coupling without increasing leading-order cost.

7. Empirical Performance and Model Integration

In experiments on language modeling (nanoGPT/LLaMA base, FineWeb-Edu 100B), Multiplicative GRAPE shows:

  • Stable training curves by contrast with oscillations observed in RoPE.
  • Validation loss on par or improved relative to RoPE for both medium (355M) and large (770M) models.
  • In downstream zero-shot evaluation tasks (ARC, BoolQ, Hellaswag, OBQA, PIQA, WinoGrande, SciQ), both GRAPE-A (additive) and GRAPE-M outperform RoPE and the Forgetting Transformer (FoX) in average accuracy (see Tables A.1–A.2 of (Zhang et al., 8 Dec 2025)).

A plausible implication is that the additional expressivity and strict group-theoretic structure of Multiplicative GRAPE can enhance both training stability and generalization in long-context models, while preserving exact relative positional laws and O(d)O(d) efficiency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multiplicative GRAPE.