Papers
Topics
Authors
Recent
Search
2000 character limit reached

Relative Position Encoding in NLP

Updated 31 January 2026
  • Relative position encoding is a technique that parameterizes attention scores based on token distance, enhancing model compositionality and shift equivariance.
  • It employs methods such as Shaw-style bias, rotary encoding, and tri-linear interactions to efficiently capture dependency structures in long sequences.
  • Empirical results demonstrate improved BLEU scores and better handling of code understanding, sentiment analysis, and long-context tasks.

Relative position encoding constitutes a family of positional modeling strategies for attention-based neural architectures that parameterize attention scores and representations directly or indirectly in terms of the signed distance between tokens, rather than their absolute sequence index. In contrast to absolute encodings, which tie model behavior to fixed positional indices, relative encodings align model capacity with dependency structures found in natural language by controlling inductive biases for locality, compositionality, extrapolation to novel sequence lengths, and invariance to input shifting. This paradigm underpins significant empirical and theoretical advances in large-scale neural machine translation, code understanding, sentiment analysis, and long-context natural language inference.

1. Mathematical Formulations and Core Mechanisms

The canonical relative position encoding scheme in Transformer-based architectures is due to Shaw et al. (Shaw et al., 2018): for a sequence {xi}i=1n\{x_i\}_{i=1}^n, multi-head self-attention is modified such that each query–key pair (i,j)(i,j) is associated with a learned bias vector aijKa^K_{ij} (for attention scores) and aijVa^V_{ij} (for value aggregation), drawn from clipped relative distance tables w−kK,…,wkKw^K_{-k},\dots,w^K_{k} and w−kV,…,wkVw^V_{-k},\dots,w^V_{k}. The attention score is

eij=(xiWQ)(xjWK+aijK)Tdz,aijK=wclip(j−i,k)Ke_{ij} = \frac{(x_i W^Q) \left( x_j W^K + a^K_{ij} \right)^T }{ \sqrt{d_z} }, \quad a^K_{ij} = w^K_{\text{clip}(j-i, k)}

and the output is

yi=∑j=1nαij(xjWV+aijV)y_i = \sum_{j=1}^n \alpha_{ij} ( x_j W^V + a^V_{ij} )

with αij=softmaxj(eij)\alpha_{ij} = \text{softmax}_j(e_{ij}). Clipping is necessary to control parameter growth and regularize model generalization for long contexts.

Extensions include tri-linear interaction schemes (Huang et al., 2020), which generalize absolute encodings, by adding explicit qiâ‹…kjq_i \cdot k_j, (i,j)(i,j)0, and (i,j)(i,j)1 terms:

(i,j)(i,j)2

For rotary encoding (RoPE) (Su et al., 2021), relative position information is realized via complex block-diagonal rotation matrices applied to query and key vectors:

(i,j)(i,j)3

with (i,j)(i,j)4 rotating each 2D subspace of the vector by (i,j)(i,j)5. The resulting dot-product evaluates to (i,j)(i,j)6, ensuring that attention depends only on (i,j)(i,j)7.

Advanced models incorporate multi-level structure:

2. Parameterization, Implementation, and Efficiency

Relative encodings parameterize a set of vectors indexed by relative distance (typically clipped). For Shaw-style models (Shaw et al., 2018), two tables (i,j)(i,j)8 are shared across heads, minimizing memory overhead ((i,j)(i,j)9 with sequence length aijKa^K_{ij}0).

Efficient computation leverages decomposition of the attention score into content–content and content–position terms, allowing exploitation of fast batched matrix multiplies. In practical settings, the additional computational cost versus absolute encoding is minimal (e.g., aijKa^K_{ij}1 slowdown) and incurs no batch size or model size increase.

Linear attention and long-context settings require encodings that compose with feature map kernels. LRPE (Qin et al., 2023) constructs a canonical form aijKa^K_{ij}2, where decomposability aijKa^K_{ij}3 enables dropping relative bias into aijKa^K_{ij}4 and aijKa^K_{ij}5 transformations pre-feature map. Spectral or permutation-based designs ensure aijKa^K_{ij}6 time for sequence length aijKa^K_{ij}7 and depth aijKa^K_{ij}8.

Streaming and FlashAttention-compatibility requirements have led to function-based approaches (HyPE (Angelotti, 2023), PermuteFormer (Chen, 2021)) that encode bias via low-rank or group-theoretic structures (sinh maps, permutation powers), avoiding aijKa^K_{ij}9 masks entirely.

3. Empirical Results and Benchmark Improvements

Relative position encodings yield robust improvement over absolute positional codes in high-resource translation (Shaw et al., 2018): e.g., +1.3 BLEU for WMT'14 EnaijVa^V_{ij}0De and +0.3 BLEU for EnaijVa^V_{ij}1Fr. In code-editing and code-mixed language modeling, models with dynamic and span-aware biases (PESTO (&&&10&&&), DTrans (Qi et al., 2022)) consistently outperform absolute and vanilla relative baselines, showing increases in exact-match and macro-F1 across tailored benchmarks.

Relative position prediction as a self-supervised objective (Brüel-Gabrielsson et al., 2022) yields denser and more topologically informative labels than MLM/CLM, improving GLUE task accuracy by up to 2.9 points (MNLI), and scales gracefully with longer sequences.

Bilevel and Bloch-sphere positional schemes enable Transformers to extrapolate to aijVa^V_{ij}28k tokens or long-document tasks with minimal drop in perplexity, outperforming linear-bias and vanilla RoPE baselines (He et al., 2024, Ma et al., 2024). Empirically, BiPE and 3D-RPE achieve significant gains in zero-shot and fine-tuned regimes for QA, summarization, and code completion benchmarks.

4. Theoretical Properties, Inductive Biases, and Extrapolation

The central theoretical advantage of relative position encoding is shift-equivariance and invariance to sequence length. By decoupling model capacity from absolute indices, the network can generalize local patterns and dependencies irrespective of their position, aligning with core linguistic and cognitive principles.

Tri-linear interaction (e.g., (Huang et al., 2020)) recovers absolute encoding as a special case, but with robust inductive generalization beyond training length due to parameterization over aijVa^V_{ij}3 rather than fixed aijVa^V_{ij}4.

Hierarchical schemes (He et al., 2024) reduce the embedding-size requirement for automaton simulation and complex structural tasks from aijVa^V_{ij}5 (absolute) to aijVa^V_{ij}6 (BiPE), where aijVa^V_{ij}7 is the total number of states in a task automaton.

Rotary and 3D-Rotary designs ensure graceful decay of attention over distance and preserve positional resolution under context-length interpolation. Spectral and group-theoretic encodings (LRPE, PermuteFormer) deliver mathematical guarantees on shift-invariance and efficiency (Qin et al., 2023, Chen, 2021).

Orthogonal polynomial encodings (PoPE (Aggarwal, 2024)) provide non-periodic, decorrelated positional vectors, avoiding aliasing and spurious bias of high-dimensional sinusoids, and enable rapid convergence and sharper long-range dependency modeling.

5. Extensions to Relation-aware Attention and Graphs

Relation-aware self-attention (Shaw et al., 2018) generalizes relative encoding beyond linear distances: each token pair aijVa^V_{ij}8 can carry arbitrary edge labels (e.g., dependency parse, coreference, knowledge-graph relations), with learned bias vectors for each relation type:

aijVa^V_{ij}9

This construction enables scalable full-graph attention in NLP and vision, allowing parallel computation for edge-labeled structures with negligible overhead.

Recent work incorporates dynamic and event-based resetting of position codes, span-based masking, and chunked multidimensional encodings, enabling adaptation to code-mixed languages, conversation turns, topic shifts, and document-level tasks (PESTO, BiPE, DTrans).

6. Limitations, Biases, and Practical Recommendations

Empirical studies indicate that relative position encodings can still incur performance drops on token classification if model or task biases tether entity prediction to position—see Ben Amor et al. (Amor et al., 2023) on systematic F1 degradation beyond position 10. Learned relative-position embeddings may be less robust for long-range token classification than absolute encoding in these regimes.

Practical recommendations include:

  • Sharing relative embedding tables across heads for memory control.
  • Clipping distance indices to small w−kK,…,wkKw^K_{-k},\dots,w^K_{k}0 for stability (precision beyond w−kK,…,wkKw^K_{-k},\dots,w^K_{k}1 is negligible).
  • When modeling complex spans or events, combine dynamic position features (e.g., switching points) with standard relative encoding (Ali et al., 2021).
  • For linear attention models, use only decomposable or group-structured relative encodings (Qin et al., 2023, Chen, 2021).

Extrapolation to longer contexts benefits from chunked or bilevel relative encoding, as standard schemes may decay too rapidly. Orthogonal polynomial designs (PoPE) offer decorrelation and recurrence-driven structure for both absolute and relative position modeling with faster convergence and improved BLEU or downstream accuracy (Aggarwal, 2024).

7. Future Directions

Continued progress centers on:

  • Unified frameworks for multi-scale relative encoding (segment, chunk, document).
  • Streaming and FlashAttention-2–compatible functional biases (HyPE (Angelotti, 2023)).
  • Generalization to arbitrary graph structures and event-driven resets.
  • Empirical benchmarking under extreme context extrapolation (>w−kK,…,wkKw^K_{-k},\dots,w^K_{k}2 tokens).
  • Combining orthogonal polynomial encodings with multi-phase rotary and linear decomposable designs for scalable language and multimodal models.

Relative position encoding remains foundational for flexible, scalable, and robust sequence modeling in NLP and related domains. Its principled mathematical constructions and empirical generalization properties continue to drive advances in LLM understanding and capacity.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Relative Position Encoding in NLP.