Papers
Topics
Authors
Recent
Search
2000 character limit reached

PaTH Attention: Position Encoding via Accumulating Householder Transformations

Published 22 May 2025 in cs.CL and cs.LG | (2505.16381v2)

Abstract: The attention mechanism is a core primitive in modern LLMs and AI more broadly. Since attention by itself is permutation-invariant, position encoding is essential for modeling structured domains such as language. Rotary position encoding (RoPE) has emerged as the de facto standard approach for position encoding and is part of many modern LLMs. However, in RoPE the key/query transformation between two elements in a sequence is only a function of their relative position and otherwise independent of the actual input. This limits the expressivity of RoPE-based transformers. This paper describes PaTH, a flexible data-dependent position encoding scheme based on accumulated products of Householder(like) transformations, where each transformation is data-dependent, i.e., a function of the input. We derive an efficient parallel algorithm for training through exploiting a compact representation of products of Householder matrices, and implement a FlashAttention-style blockwise algorithm. Across both targeted synthetic benchmarks and moderate-scale real-world language modeling experiments, we find that PaTH improves upon RoPE and other recent baselines. Finally, we show that we can convert pretrained RoPE transformers into PaTH with continued pretraining.

Summary

  • The paper introduces a novel data-dependent mechanism using accumulating Householder transformations to surpass RoPE's expressivity limitations.
  • The methodology enables effective state tracking and long-context extrapolation using efficient blockwise computation and low-rank iterative updates.
  • Empirical results demonstrate near-perfect performance on synthetic tasks and overall improvements in language and retrieval benchmarks.

PaTH Attention: Position Encoding via Accumulating Householder Transformations

Problem Context and Limitations of Existing Methods

Attention mechanisms are the computational foundation of contemporary transformer-based architectures. The inherent permutation-invariance of attention necessitates explicit encoding of positional information for effective sequence modeling. Rotary Position Embedding (RoPE) is widely adopted as the default position encoding in LLMs, owing to its efficiency and simplicity. However, RoPE applies a fixed, data-independent transformation to key and query vectors, solely determined by relative position. This restricts the capacity of RoPE-based attention to model complex, input-dependent positional dependencies and constrains the class of functions that transformer models can compute, as established in complexity-theoretic analyses.

Recent empirical studies have highlighted notable shortcomings of RoPE, specifically its inability to solve various synthetic state-tracking tasks, such as flip-flop language modeling and group word problems, which require a more expressive, data-adaptive positional mechanism. These observations underscore the necessity for fundamentally more flexible position encoding strategies.

PaTH Attention: Data-Dependent Householder-Based Encoding

PaTH introduces a new class of multiplicative, data-dependent position encodings leveraging accumulated Householder-like transformations, where each transition is an identity-plus-rank-one matrix parameterized by current token input. The main formulation involves attention logits of the form qiTHijkjq_i^T H_{ij} k_j, with HijH_{ij} representing a cumulative product of token-wise Householder matrices traversing the sequence span between positions jj and ii.

Householder-like matrices in PaTH are defined as

Ht=I−BtwtwtT,H_t = I - B_t w_t w_t^T,

where BtB_t is a scalar gating function (e.g., 2â‹…sigmoid(uTxt+b)2 \cdot \text{sigmoid}(u^T x_t + b)), and wtw_t is a low-rank, data-dependent vector derived from input features. This generalizes RoPE, which is a special case with static, block-diagonal rotation matrices, to a far more expressive schema, allowing PaTH to capture intricate input-conditioned transition dynamics. The data-dependent nature and matrix-product structure enables PaTH-augmented transformers to transcend the expressivity limitations of classic RoPE-based variants.

A rigorous theoretical result is established: a single-layer, two-head PaTH transformer can solve a canonical NC1-complete problem—iterated permutation identity testing over S5S_5—which is not computable by RoPE-based transformers under common complexity assumptions. PaTH thereby expands the computational power of transformer models beyond the TC0^0 barrier.

Efficient Algorithmic Implementation

The cumulative product structure of data-dependent Householder matrices raises practical concerns regarding computational efficiency, especially for long sequences. PaTH addresses this by employing compact UT-transform-based representations for products of Householder matrices, enabling reuse across subintervals via masked versions and facilitating efficient, blockwise computation.

A FlashAttention-style block-parallel algorithm is devised, integrating the cumulative PaTH transformations with online softmax statistics and minimal DRAM traffic. Through blockwise boundary adjustments of queries and keys, cumulative Householder products are affinely propagated, preserving both hardware efficiency and the dynamic data-dependent structure of the transformation. The computational complexity of the entire procedure matches the O(L2d)O(L^2d) scaling of standard attention for typical block sizes.

For inference, a low-rank iterative update scheme incrementally propagates the PaTH transformation through the cached keys, maintaining compatibility with existing fast decoding pipelines (FlashDecoding, PagedAttention) and facilitating distributed context-parallelism (e.g., Ring Attention).

Empirical Evaluation

Synthetic Benchmarks

  • Flip-Flop Language Modeling (FFLM): PaTH transformers achieve near-perfect (≤0.0001% error) performance on this diagnostic state-tracking task with a single layer and two heads. In contrast, RoPE, Stick-Breaking Attention, and FoX suffer substantial degradation, especially under out-of-distribution regimes.
  • NC1-Complete Word Problems: PaTH requires only two layers to achieve >90% accuracy on the A5A_5 permutation identity test, halving the depth requirement relative to other methods.
  • Multi-Query Repeated Associative Recall (MQRAR-N): PaTH maintains robust performance for NN-back associative recall at N<4N < 4, demonstrating preserved ordered memory across extended input tokens, outperforming both RoPE and advanced logit-biasing methods.

Language Modeling and Long-Context Generalization

On a 760M-parameter transformer pretrained on 50B tokens, PaTH uniformly improves over RoPE and FoX on standard language modeling metrics (WikiText, LAMBADA, PIQA, HellaSwag, ARC). In particular, PaTH-FoX—the combination of PaTH and Forgetting Transformer gates—consistently reaches the best or second-best perplexity and zero-shot accuracies.

Length Extrapolation

PaTH-based models generalize substantially beyond their training context window. On long-context evaluation (up to 64K tokens) with corpora such as PG-19, CodeParrot, and NarrativeQA, RoPE-based transformer's performance collapses soon after exceeding the training window; PaTH and PaTH-FoX maintain stable perplexity far longer, with especially large gains in domains requiring strong state tracking (e.g., code modeling).

Long-Form and Retrieval Benchmarks

Across RULER, BABILONG, PhoneBook, and LongBench-E benchmarks—covering both retrieval and state-tracking in long-context settings—PaTH-FoX models consistently achieve the highest or most robust scores, markedly outperforming both RoPE and FoX. PaTH's advantage is especially salient on variable tracking and complex logic-based reasoning tasks, indicating a substantial improvement in models' ability to maintain and update dynamic, structured memory over extended contexts.

Compatibility and Theoretical Integration

PaTH's construction is inherently modular: it is compatible with other softmax attention generalizations (FoX, SBA, Selective Attention) and can be used in combination for further performance gains, particularly in settings demanding both forgetting and adaptive memory (e.g., PaTH-FoX). From a theoretical perspective, PaTH offers a principled multiplicative position encoding with provably greater expressivity; its identity-plus-rank-one parametrization closely mirrors the transition dynamics of highly expressive linear RNNs, while fully retaining the associative recall capability of attention.

Implications and Future Directions

PaTH establishes a new paradigm for position encoding in transformers: data-dependent, multiplicative encoding with efficient hardware realization and proven expressivity advances. This raises the possibility that increasingly complex, task-adaptive positional encodings, grounded in dynamic matrix products, may become the next standard for transformer-based sequence models, superseding static schemes like RoPE.

Open avenues include generalizing low-rank update mechanisms to value vectors, exploring higher-order or non-linear transition compositions, designing more hardware-friendly refinement of the key-value caches, and potentially integrating PaTH-style dynamic transitions into state-space models or alternative non-attention-based architectures. Further research is warranted to characterize formal limitations, derive more efficient hardware kernels, and quantify gains in very-large-scale, real-world LLM deployments.

Conclusion

PaTH Attention introduces a formally principled, efficiently computable data-dependent position encoding mechanism, utilizing accumulating Householder-like transformations. Experimental and theoretical results demonstrate robust improvement in both expressivity and practical benchmark performance, particularly for state tracking, length extrapolation, and long-context retrieval tasks, substantially surpassing static encoding baselines. PaTH represents a significant advance in position encoding for modern sequence modeling and provides a strong foundation for further innovation in attention mechanisms and transformer architectures (2505.16381).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 9 tweets with 436 likes about this paper.