Relative Position Encoding in NLP

Updated 31 January 2026

Relative position encoding is a technique that parameterizes attention scores based on token distance, enhancing model compositionality and shift equivariance.
It employs methods such as Shaw-style bias, rotary encoding, and tri-linear interactions to efficiently capture dependency structures in long sequences.
Empirical results demonstrate improved BLEU scores and better handling of code understanding, sentiment analysis, and long-context tasks.

Relative position encoding constitutes a family of positional modeling strategies for attention-based neural architectures that parameterize attention scores and representations directly or indirectly in terms of the signed distance between tokens, rather than their absolute sequence index. In contrast to absolute encodings, which tie model behavior to fixed positional indices, relative encodings align model capacity with dependency structures found in natural language by controlling inductive biases for locality, compositionality, extrapolation to novel sequence lengths, and invariance to input shifting. This paradigm underpins significant empirical and theoretical advances in large-scale neural machine translation, code understanding, sentiment analysis, and long-context natural language inference.

1. Mathematical Formulations and Core Mechanisms

The canonical relative position encoding scheme in Transformer-based architectures is due to Shaw et al. (Shaw et al., 2018): for a sequence $\{x_i\}_{i=1}^n$ , multi-head self-attention is modified such that each query–key pair $(i,j)$ is associated with a learned bias vector $a^K_{ij}$ (for attention scores) and $a^V_{ij}$ (for value aggregation), drawn from clipped relative distance tables $w^K_{-k},\dots,w^K_{k}$ and $w^V_{-k},\dots,w^V_{k}$ . The attention score is

$e_{ij} = \frac{(x_i W^Q) \left( x_j W^K + a^K_{ij} \right)^T }{ \sqrt{d_z} }, \quad a^K_{ij} = w^K_{\text{clip}(j-i, k)}$

and the output is

$y_i = \sum_{j=1}^n \alpha_{ij} ( x_j W^V + a^V_{ij} )$

with $\alpha_{ij} = \text{softmax}_j(e_{ij})$ . Clipping is necessary to control parameter growth and regularize model generalization for long contexts.

Extensions include tri-linear interaction schemes (Huang et al., 2020), which generalize absolute encodings, by adding explicit $q_i \cdot k_j$ , $(i,j)$ 0, and $(i,j)$ 1 terms:

$(i,j)$ 2

For rotary encoding (RoPE) (Su et al., 2021), relative position information is realized via complex block-diagonal rotation matrices applied to query and key vectors:

$(i,j)$ 3

with $(i,j)$ 4 rotating each 2D subspace of the vector by $(i,j)$ 5. The resulting dot-product evaluates to $(i,j)$ 6, ensuring that attention depends only on $(i,j)$ 7.

Advanced models incorporate multi-level structure:

Bilevel Positional Encoding (He et al., 2024) splits absolute intra-segment and relative inter-segment positional codes, facilitating hierarchical decomposition for both local and global attention.
3D Rotary Position Encoding (Ma et al., 2024) generalizes RoPE using a Bloch-Sphere mechanism, encoding both within-chunk and chunk-to-chunk distances with multi-phase rotations for controllable long-term decay.

2. Parameterization, Implementation, and Efficiency

Relative encodings parameterize a set of vectors indexed by relative distance (typically clipped). For Shaw-style models (Shaw et al., 2018), two tables $(i,j)$ 8 are shared across heads, minimizing memory overhead ( $(i,j)$ 9 with sequence length $a^K_{ij}$ 0).

Efficient computation leverages decomposition of the attention score into content–content and content–position terms, allowing exploitation of fast batched matrix multiplies. In practical settings, the additional computational cost versus absolute encoding is minimal (e.g., $a^K_{ij}$ 1 slowdown) and incurs no batch size or model size increase.

Linear attention and long-context settings require encodings that compose with feature map kernels. LRPE (Qin et al., 2023) constructs a canonical form $a^K_{ij}$ 2, where decomposability $a^K_{ij}$ 3 enables dropping relative bias into $a^K_{ij}$ 4 and $a^K_{ij}$ 5 transformations pre-feature map. Spectral or permutation-based designs ensure $a^K_{ij}$ 6 time for sequence length $a^K_{ij}$ 7 and depth $a^K_{ij}$ 8.

Streaming and FlashAttention-compatibility requirements have led to function-based approaches (HyPE (Angelotti, 2023), PermuteFormer (Chen, 2021)) that encode bias via low-rank or group-theoretic structures (sinh maps, permutation powers), avoiding $a^K_{ij}$ 9 masks entirely.

3. Empirical Results and Benchmark Improvements

Relative position encodings yield robust improvement over absolute positional codes in high-resource translation (Shaw et al., 2018): e.g., +1.3 BLEU for WMT'14 En $a^V_{ij}$ 0De and +0.3 BLEU for En $a^V_{ij}$ 1Fr. In code-editing and code-mixed language modeling, models with dynamic and span-aware biases (PESTO (&&&10&&&), DTrans (Qi et al., 2022)) consistently outperform absolute and vanilla relative baselines, showing increases in exact-match and macro-F1 across tailored benchmarks.

Relative position prediction as a self-supervised objective (Brüel-Gabrielsson et al., 2022) yields denser and more topologically informative labels than MLM/CLM, improving GLUE task accuracy by up to 2.9 points (MNLI), and scales gracefully with longer sequences.

Bilevel and Bloch-sphere positional schemes enable Transformers to extrapolate to $a^V_{ij}$ 28k tokens or long-document tasks with minimal drop in perplexity, outperforming linear-bias and vanilla RoPE baselines (He et al., 2024, Ma et al., 2024). Empirically, BiPE and 3D-RPE achieve significant gains in zero-shot and fine-tuned regimes for QA, summarization, and code completion benchmarks.

4. Theoretical Properties, Inductive Biases, and Extrapolation

The central theoretical advantage of relative position encoding is shift-equivariance and invariance to sequence length. By decoupling model capacity from absolute indices, the network can generalize local patterns and dependencies irrespective of their position, aligning with core linguistic and cognitive principles.

Tri-linear interaction (e.g., (Huang et al., 2020)) recovers absolute encoding as a special case, but with robust inductive generalization beyond training length due to parameterization over $a^V_{ij}$ 3 rather than fixed $a^V_{ij}$ 4.

Hierarchical schemes (He et al., 2024) reduce the embedding-size requirement for automaton simulation and complex structural tasks from $a^V_{ij}$ 5 (absolute) to $a^V_{ij}$ 6 (BiPE), where $a^V_{ij}$ 7 is the total number of states in a task automaton.

Rotary and 3D-Rotary designs ensure graceful decay of attention over distance and preserve positional resolution under context-length interpolation. Spectral and group-theoretic encodings (LRPE, PermuteFormer) deliver mathematical guarantees on shift-invariance and efficiency (Qin et al., 2023, Chen, 2021).

Orthogonal polynomial encodings (PoPE (Aggarwal, 2024)) provide non-periodic, decorrelated positional vectors, avoiding aliasing and spurious bias of high-dimensional sinusoids, and enable rapid convergence and sharper long-range dependency modeling.

5. Extensions to Relation-aware Attention and Graphs

Relation-aware self-attention (Shaw et al., 2018) generalizes relative encoding beyond linear distances: each token pair $a^V_{ij}$ 8 can carry arbitrary edge labels (e.g., dependency parse, coreference, knowledge-graph relations), with learned bias vectors for each relation type:

$a^V_{ij}$ 9

This construction enables scalable full-graph attention in NLP and vision, allowing parallel computation for edge-labeled structures with negligible overhead.

Recent work incorporates dynamic and event-based resetting of position codes, span-based masking, and chunked multidimensional encodings, enabling adaptation to code-mixed languages, conversation turns, topic shifts, and document-level tasks (PESTO, BiPE, DTrans).

6. Limitations, Biases, and Practical Recommendations

Empirical studies indicate that relative position encodings can still incur performance drops on token classification if model or task biases tether entity prediction to position—see Ben Amor et al. (Amor et al., 2023) on systematic F1 degradation beyond position 10. Learned relative-position embeddings may be less robust for long-range token classification than absolute encoding in these regimes.

Practical recommendations include:

Sharing relative embedding tables across heads for memory control.
Clipping distance indices to small $w^K_{-k},\dots,w^K_{k}$ 0 for stability (precision beyond $w^K_{-k},\dots,w^K_{k}$ 1 is negligible).
When modeling complex spans or events, combine dynamic position features (e.g., switching points) with standard relative encoding (Ali et al., 2021).
For linear attention models, use only decomposable or group-structured relative encodings (Qin et al., 2023, Chen, 2021).

Extrapolation to longer contexts benefits from chunked or bilevel relative encoding, as standard schemes may decay too rapidly. Orthogonal polynomial designs (PoPE) offer decorrelation and recurrence-driven structure for both absolute and relative position modeling with faster convergence and improved BLEU or downstream accuracy (Aggarwal, 2024).

7. Future Directions

Continued progress centers on:

Unified frameworks for multi-scale relative encoding (segment, chunk, document).
Streaming and FlashAttention-2–compatible functional biases (HyPE (Angelotti, 2023)).
Generalization to arbitrary graph structures and event-driven resets.
Empirical benchmarking under extreme context extrapolation (> $w^K_{-k},\dots,w^K_{k}$ 2 tokens).
Combining orthogonal polynomial encodings with multi-phase rotary and linear decomposable designs for scalable language and multimodal models.

Relative position encoding remains foundational for flexible, scalable, and robust sequence modeling in NLP and related domains. Its principled mathematical constructions and empirical generalization properties continue to drive advances in LLM understanding and capacity.