Multiplicative GRAPE: Lie Group Position Encoding

Updated 9 December 2025

Multiplicative GRAPE is a Lie group-based positional encoding that leverages multiplicative rotations to provide norm-preserving, relative, and compositional mappings.
It employs a rank-2 skew generator and closed-form Rodrigues-type solutions to enforce exact relative attention laws with efficient O(d) complexity.
Empirical results demonstrate stable training and improved generalization in long-context models compared to traditional block-diagonal approaches.

Multiplicative GRAPE (GRAPE-M) is a positional encoding mechanism grounded in Lie group actions, specifically utilizing multiplicative rotations from the special orthogonal group $\mathrm{SO}(d)$ . It provides a rigorous, parameterized family of norm-preserving, relative, and compositional position mappings for attention-based neural architectures. GRAPE-M generalizes and subsumes the Rotary Position Embedding (RoPE) by extending beyond canonical block-diagonal rotations to include learned multi-subspace rotations and compact mixtures of non-commuting group actions with tractable computational overhead. It is a central component of the GRAPE (Group RepresentAtional Position Encoding) framework for positional geometry in long-context models, admitting closed-form solutions and strict relative attention laws (Zhang et al., 8 Dec 2025).

1. Mathematical Construction: Rank-2 Skew Generator

At the core of Multiplicative GRAPE is a rank-2 skew-symmetric generator $L(a,b) = a b^\top - b a^\top \in \mathfrak{so}(d)$ , where $a, b \in \mathbb{R}^d$ . For $a, b$ fixed:

Compute $\alpha = \|a\|^2$ , $\beta = \|b\|^2$ , $\gamma = a^\top b$
Define $\Delta = \alpha\beta - \gamma^2 \geq 0$ , $s = \sqrt{\Delta}$
$L$ acts nontrivially only on $U = \operatorname{span}\{a, b\}$ and satisfies $L^2 = -s^2 P_U$ , where $P_U$ is the orthogonal projector onto $U$

The spectrum of $L$ is $\{\pm is, 0, ..., 0\}$ , ensuring that $\exp(n\omega L)$ is a rotation in $U$ for frequency $\omega > 0$ and token index $n \in \mathbb{Z}$ (or $n \in \mathbb{R}$ for continuous positions).

2. One-Parameter Subgroup, Relative Law, and Closed-Form Exponential

GRAPE-M defines the multiplicative positional map as $G(n) = \exp(n\omega L) \in \mathrm{SO}(d)$ . The exponential map generates a one-parameter subgroup:

$G(n+m) = G(n)G(m)$
$G(0) = I$
$G(-n) = G(n)^\top = G(n)^{-1}$

Applied to attention, if $q_i$ and $k_j$ are query and key embeddings at positions $i$ and $j$ , GRAPE-M enforces

$(G(i)q_i)^\top (G(j)k_j) = q_i^\top G(j-i)k_j,$

giving strictly relative positional dependence.

The matrix exponential yields a closed-form “Rodrigues”-type solution:

$\exp(\eta L) = I + \frac{\sin s\eta}{s}L + \frac{1-\cos s\eta}{s^2}L^2.$

This formula achieves a planar rotation by $\eta s$ in $U$ and identity outside $U$ .

3. Recovery of RoPE and the Canonical Case

RoPE arises as a special case where $b = J a$ , with $J$ the canonical block-diagonal complex structure ( $J^2 = -I$ ; 90° blocks). Taking $a$ and $b$ such that $\|a\| = \|b\| = 1$ , $a^\top b = 0$ , and $s=1$ , the generator satisfies $L(a,J a)|_U = -J|_U$ and $L^2 = -P_U$ . The resulting position map is

$G(n) = I + \sin(n\omega) L + (1 - \cos(n\omega)) L^2 = I - (1-\cos(n\omega))P_U - \sin(n\omega) J P_U.$

For $d/2$ disjoint planes (canonical coordinate pairs) and per-plane frequencies $\theta_i$ ,

$L_\mathrm{tot} = \sum_{i=1}^{d/2} \theta_i L(e_{2i-1}, e_{2i}),$

and $G(n) = \prod_{i=1}^{d/2} \exp(n\theta_i L_i)$ is block-diagonal. Log-uniform choice $\theta_i = \omega_0\log(10000^{2i/d})$ exactly recovers the RoPE spectrum.

4. Extensions: Learned Commuting Subspaces and Non-Commuting Mixtures

Learned Commuting Subspaces

Multiplicative GRAPE naturally extends to learned multi-plane actions:

Let $B \in \mathrm{SO}(d)$ (orthogonal basis), select planes $U_i$ via $[e_{2i-1}, e_{2i}]$
$L_i = B U_i J U_i^\top B^\top$ (mutually orthogonal, $[L_i, L_j] = 0$ )
Aggregate $L = \sum_{i=1}^{d/2} \theta_i L_i$ , $G(n) = \exp(nL) = \prod_i \exp(n \theta_i L_i)$

This allows learning per-head or per-plane spectra while maintaining $O(d)$ computational and memory cost per head. Each token's features are rotated independently in $d/2$ planes.

Compact Non-Commuting Mixtures

To capture cross-subspace structure, GRAPE-M provides compact mixtures of $m$ non-commuting rank-2 generators:

$L_r = \sum_{j=1}^m \omega_j L(a_j, b_j)$ (span of all $\{a_j,b_j\}$ has dimension $r=2m$ )
Project features into the $r$ -dimensional subspace using an orthonormal basis $E \in \mathbb{R}^{d\times r}$
Compress $L_r$ to $L_r^U\in\mathfrak{so}(r)$ , compute real-Schur decomposition $L_r^U = T (\oplus_{t=1}^m \theta_t J) T^\top$
Rotate in each $2$-plane by $n\theta_t$

The operation requires $O(r d)$ cost per head and enables non-commuting, coupled feature mixing beyond block-diagonal structure.

5. Pseudocode Overview

Two core algorithms underpin efficient implementation:

Algorithm	Description	Complexity
Algorithm 1	Commuting Multi-Subspace GRAPE-M: apply $d/2$ planar rotations via orthogonal $B$	$O(d)$ per head
Algorithm 2	Fast Non-Commuting via Schur: compress to $r$ -dim, $\mathfrak{so}(r)$ , planar rotations in mixed subspaces	$O(r d)$ per head

Algorithmic Steps:

For commuting case, rotate each coordinate pair by $\cos\theta$ , $\sin\theta$ , as parameterized by position and frequency.
For non-commuting case, project to a lower-dimensional subspace, apply multiple coupled planar rotations, and project back.

6. Initialization, Stability, and Computational Considerations

Each plane generator is gauge-fixed with $\|a\| = \|b\| = 1$ , $a^\top b = 0$ ; scale is absorbed into $\omega$ .
Frequencies $\omega$ or $\{\theta_i\}$ can be fixed (e.g., log-uniform) or learned per head/plane.
The closed-form uses $f_1(z) = \sin z / z$ , $f_2(z) = (1-\cos z)/z^2$ with Taylor expansions guarding $z\to 0$ for numerical stability.
Differentiation with respect to $\omega$ , $a$ , $b$ yields analytic, stable gradients.
Complexity: GRAPE-M achieves $O(d)$ time, $O(d)$ memory, and $O(d)$ parameters per head (contrasted with $O(d^3)$ and $O(d^2)$ for full-matrix exponentials).
Compared to block-diagonal RoPE, GRAPE-M enables richer feature coupling without increasing leading-order cost.

7. Empirical Performance and Model Integration

In experiments on language modeling (nanoGPT/LLaMA base, FineWeb-Edu 100B), Multiplicative GRAPE shows:

Stable training curves by contrast with oscillations observed in RoPE.
Validation loss on par or improved relative to RoPE for both medium (355M) and large (770M) models.
In downstream zero-shot evaluation tasks (ARC, BoolQ, Hellaswag, OBQA, PIQA, WinoGrande, SciQ), both GRAPE-A (additive) and GRAPE-M outperform RoPE and the Forgetting Transformer (FoX) in average accuracy (see Tables A.1–A.2 of (Zhang et al., 8 Dec 2025)).

A plausible implication is that the additional expressivity and strict group-theoretic structure of Multiplicative GRAPE can enhance both training stability and generalization in long-context models, while preserving exact relative positional laws and $O(d)$ efficiency.

Markdown Report Issue Upgrade to Chat

References (1)

Group Representational Position Encoding (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multiplicative GRAPE.

Multiplicative GRAPE: Lie Group Position Encoding

1. Mathematical Construction: Rank-2 Skew Generator

2. One-Parameter Subgroup, Relative Law, and Closed-Form Exponential

3. Recovery of RoPE and the Canonical Case

4. Extensions: Learned Commuting Subspaces and Non-Commuting Mixtures

Learned Commuting Subspaces

Compact Non-Commuting Mixtures

5. Pseudocode Overview

6. Initialization, Stability, and Computational Considerations

7. Empirical Performance and Model Integration

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Multiplicative GRAPE: Lie Group Position Encoding

1. Mathematical Construction: Rank-2 Skew Generator

2. One-Parameter Subgroup, Relative Law, and Closed-Form Exponential

3. Recovery of RoPE and the Canonical Case

4. Extensions: Learned Commuting Subspaces and Non-Commuting Mixtures

Learned Commuting Subspaces

Compact Non-Commuting Mixtures

5. Pseudocode Overview

6. Initialization, Stability, and Computational Considerations

7. Empirical Performance and Model Integration

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research