Selective Rotary Position Embedding

Updated 24 January 2026

Selective RoPE is a positional encoding method that modulates rotary angles via input-dependent gating to overcome fixed-angle rigidity.
It unifies fixed rotary encodings with selective gating mechanisms, reducing inefficiency and improving attention on long sequences.
Empirical results show that Selective RoPE reduces language model perplexity and improves recall tasks by dynamically adjusting phase and decay.

Selective Rotary Position Embedding (Selective RoPE) is a positional encoding technique for transformer-based architectures that generalizes the standard Rotary Position Embedding (RoPE) by introducing input-dependent, learnable modulation of rotation angles. This mechanism enables transformers to flexibly control position encoding strength at each sequence location, yielding improved performance in both language modeling and synthetic recall tasks, particularly in settings requiring robust sequence length extrapolation or precise memory of prior tokens. Selective RoPE unifies the relative positional expressiveness of fixed rotary encodings with the selective gating mechanisms seen in state-space models and gated linear transformers, integrating both orthonormal phase rotation and input-adaptive decay for information retention and flexible positional representation (Movahedi et al., 21 Nov 2025).

1. Limitations of Standard Rotary Position Embedding

Traditional RoPE—introduced in the RoFormer architecture—applies a block-diagonal sequence of planar rotations to query and key vectors in transformer attention, parameterized by frequency vectors θ (or ω). Formally, for position $t$ and frequency $\omega$ , each pair of hidden features is rotated:

$q_n \mapsto R(\omega_n t) q_n,\quad k_n \mapsto R(\omega_n \tau) k_n$

where $R(\theta)$ is a $2\times2$ rotation matrix. In attention, this produces relative-position-sensitive dot products:

$\text{Att}_{t, \tau} = \exp\left(q_t^\top R(\omega)^{t-\tau} k_\tau\right)$

RoPE thus encodes positions via fixed angular increments determined purely by token index and frequency schedule. While this enables efficient and generalizable relative encoding with minimal parameter and memory overhead, it is fundamentally limited in two ways (Su et al., 2021):

Fixed-angle rigidity: The schedule of angular increments is static, unable to adapt to varying task or input requirements.
Lack of selectivity: The entire feature space receives uniform positional treatment, with no mechanism to attenuate, gate, or suppress position encoding in specific heads, dimensions, or tokens as task demands vary.

Recent work empirically demonstrates substantial "dimension inefficiency" in RoPE: in long-range retrieval heads of major LLMs, the highest-frequency Rotary dimensions are systematically assigned negligible weight or entirely ignored, implying that a portion of the feature space is wasted for long-sequence modeling (Chiang et al., 16 Feb 2025). This motivates more adaptive, selective rotation schemes.

2. Mathematical Definition of Selective RoPE

Selective RoPE replaces RoPE's fixed angular schedule with input-dependent, learnable gating that modulates the total rotation applied to each head, channel, or token position. The essential transformation consists of:

Base schedule: As in RoPE, a per-pair frequency vector $\Theta = \operatorname{diag}(\omega_1, \ldots, \omega_{d/2})$ defines baseline angular increments per embedding dimension.
Input-dependent gating: A learned function $g(x_{\leq t})$ ("phase gate") modulates the rotation:

$\theta_t = g(x_{\leq t}) \odot (\Theta t)$

State transition: In the complex basis, rotation and decay are combined via diagonal operator:

$A_t = \operatorname{diag}(\alpha_t) \cdot \operatorname{diag}(e^{i \theta_t})$

where $\alpha_t \in (0,1)^d$ is a real-valued memory-decay/forget gate and $e^{i\theta_t}$ is the rotational factor.

Integration in transformers:
- For attention computation: queries and keys are rotated using $\theta_t$ rather than fixed increments.
- In linear attention/state-space models: the recurrence
$H_t = H_{t-1}A_t + Q_t K_t^\top$

governs accumulation, with both content attenuation and position encoding now input-adaptive.

This formalism admits both per-channel and per-token selectivity. The gate $g$ can be realized with a shallow network (e.g., small MLP or 1D convolution) over the query $q_t$ , optionally followed by a sigmoid for smooth attenuation.

3. Theoretical Underpinnings and Implicit Selectivity

Selective RoPE draws theoretical justification from the observation that standard softmax attention kernels (even without explicit position encoding) implement a random Fourier feature (RFF) expansion. Specifically:

$\exp(q^\top k) = \mathrm{Re}\,\mathbb{E}_{\omega \sim \mathcal{N}(0, I)}[\phi_\omega(q)^\top \phi_\omega(k)]$

with $\phi_\omega(x) = \exp(\|x\|^2 / 2 + i\omega^\top x)$ . Thus, softmax transformers inherently perform input-dependent, frequency-specific rotations on pairwise query-key features. RoPE simply "hard-codes" these phase shifts.

Conversely, linear transformers and diagonal state-space models introduce input-dependent attenuation (decay) but often lack a systematic, learnable phase (rotation) component. Selective RoPE unites both: learned rotations for phase (position) and decay for forgetting, matching the DFT plus exponential windowing paradigm familiar in classical signal processing. Both ingredients are necessary to combat spectral leakage and enable robust, selective recall (Movahedi et al., 21 Nov 2025).

4. Implementation Methods and Variants

Selective RoPE is implemented by augmenting each rotary embedding application with a learned, input-dependent modulator. Typical components include:

Phase gate $g_t$ : A small network mapping $q_t$ to scalar weights in $[0,1]$ (via a sigmoid), optionally stabilized by a bias term.
Cumulative update:
- Compute incremental $\Delta \theta_t = \mathrm{conv1d}(q_t)$ or similar.
- Accumulate $\theta_t = \theta_{t-1} + \Delta\theta_t$ over sequence.
- Optionally, apply phase gate: $\theta_t = \theta_t \odot g_t$ .
Rotary application: Replace standard $\sin(\omega t), \cos(\omega t)$ in RoPE with $\sin(\theta_t), \cos(\theta_t)$ per position/dimension.
Integration with forget gate: In Gated Linear Transformers (GLA), combine with learned $\alpha_t$ decay:

$A_t = \operatorname{diag}(\alpha_t) \cdot \operatorname{diag}(e^{i\theta_t})$

Pseudocode for one variant:

def selective_rope(q, k):
    phi = conv1d(W_phi * q)         # produce Δθ_t
    theta = cumsum(phi, dim=sequence)
    g = sigmoid(W_g * q)            # phase gate
    theta = theta * g
    # Standard RoPE application with new θ
    q_rot, k_rot = apply_rope(q, k, theta)
    return q_rot, k_rot

This incurs only modest computational cost (one extra small network, one vector cumulative sum, standard batched rotations).

5. Empirical Results and Comparisons

Selective RoPE demonstrates empirical superiority over fixed-angle RoPE across a wide range of synthetic and real-world tasks:

Recall-oriented synthetic tasks: On tests such as Multi-Query Associative Recall (MQAR), compress/fuzzy recall, in-context recall, and copying, Selective RoPE used in Gated Linear Attention (GLA) narrows or closes the gap to softmax-transformer performance on long sequences, especially beyond the training horizon. Copying and state-tracking on permutation groups (S₂, A₃) are reliably solved only when selective rotations are enabled (Movahedi et al., 21 Nov 2025).
Language modeling: On 370M-parameter models pretrained on FineWeb (35B tokens), GLA with Selective RoPE reduces perplexity (e.g., 23.96 → 20.12) versus fixed RoPE. It also improves zero-shot downstream benchmark accuracy (e.g., +0.4–1.0 percentage points on LAMBADA, PIQA, HellaSWAG, ARC-Easy/Challenge).
Length extrapolation: Fixed RoPE suffers from rapid quality degradation when evaluating beyond the maximal training length. By contrast, Selective RoPE dynamically learns how much rotation to apply at each position, thereby maintaining stable recall and modeling ability in long-context or extrapolated-length settings.

Model	Perplexity (Fixed RoPE)	Perplexity (Selective RoPE)	Downstream Acc Δ
GLA (370M)	23.96	20.12	+0.4–1.0 pts

Ablations: Introduction of phase-gate networks and additive bias terms in $\theta_t$ stabilizes training and further improves downstream performance, especially at high learning rates.

6. Broader Impact, Opportunities, and Limitations

Selective RoPE enables new control over positional encoding in transformers:

Unified paradigm: It integrates the benefits of both fixed, rigid rotations (RoPE) and adaptive, selective forgetting (SSM/GLA), permitting norm-preserving, input-driven positional flexibility and context-dependent memory.
Selectivity and efficiency: Through input-dependent gating, the transformer can modulate which positions or frequencies receive strong rotation, deactivate unnecessary rotary dimensions (as motivated by dimension inefficiency studies (Chiang et al., 16 Feb 2025)), and thus potentially reduce redundancy in parameter usage.
Robustness to extrapolation: By learning how much rotation to apply, models equipped with Selective RoPE are more robust to untrained sequence lengths and various sequence-processing tasks.
Open questions: Formal understanding of sequence-length generalization with learned phase gates remains incomplete. Extensions to richer, structured gating (e.g., low-rank, spatially-aware) and joint optimization with other adaptive transformer modules (e.g., MoE, adaptive-depth networks) present active research avenues.

Future research directions include formal characterizations of extrapolation under selective rotations, tighter coupling of decay and phase in memory models, and exploration of dynamic RoPE selection at the per-head or per-dimension level, following the observed dimension-inefficiency in fixed RoPE (Movahedi et al., 21 Nov 2025, Chiang et al., 16 Feb 2025).

Markdown Report Issue Upgrade to Chat

References (3)

Selective Rotary Position Embedding (2025)

RoFormer: Enhanced Transformer with Rotary Position Embedding (2021)

The Rotary Position Embedding May Cause Dimension Inefficiency in Attention Heads for Long-Distance Retrieval (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Selective Rotary Position Embedding.

Selective Rotary Position Embedding

1. Limitations of Standard Rotary Position Embedding

2. Mathematical Definition of Selective RoPE

3. Theoretical Underpinnings and Implicit Selectivity

4. Implementation Methods and Variants

5. Empirical Results and Comparisons

6. Broader Impact, Opportunities, and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Selective Rotary Position Embedding

1. Limitations of Standard Rotary Position Embedding

2. Mathematical Definition of Selective RoPE

3. Theoretical Underpinnings and Implicit Selectivity

4. Implementation Methods and Variants

5. Empirical Results and Comparisons

6. Broader Impact, Opportunities, and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research