Wavelet-Induced Rotary Encodings (WIRE)

Updated 20 January 2026

WIRE is a positional and structural encoding framework that generalizes RoPE using multi-scale continuous wavelet transforms and Laplacian spectral representations.
For long-context transformers, it employs unique scale and shift parameters to capture both global and local features, ensuring robust extrapolation.
On graph data, WIRE integrates Laplacian eigenvector coordinates to inject permutation equivariance and principled distance bias into attention mechanisms.

Wavelet-Induced Rotary Encodings (WIRE) are a positional and structural encoding framework for attention-based models, generalizing rotary position embedding (RoPE) to both long-context sequential modeling and arbitrary graph-structured data. WIRE leverages multi-scale and spectral wavelet representations to achieve robust extrapolation, inherent permutation equivariance, and principled adaptation to non-stationary inputs. The method has been advanced in two principal directions: WIRE for long-context LLMs via multi-scale continuous wavelet transforms, and WIRE for graphs via Laplacian spectral coordinates, unifying geometric and topological inductive biases within attention mechanisms (Oka et al., 4 Feb 2025, Reid et al., 26 Sep 2025).

1. Rotary Position Embedding as a Restricted Wavelet Transform

RoPE, the dominant mechanism in many LLMs and Vision Transformers (ViTs), encodes positional information by applying blockwise 2D rotation matrices along each query and key subvector, parameterized by fixed frequencies. RoPE can be formally re-expressed as a Haar-like continuous wavelet transform of the input signal $x(t)$ with scale $a=1$ , where positional encoding corresponds to evaluating the transform at a fixed dyadic window and specified shifts: $W(a, b) = \sum_{t = 0}^{T-1} \psi_{a, b}(t) x(t), \;\; \psi_{a, b}(t) = \frac{1}{\sqrt{a}} \psi\left(\frac{t - b}{a}\right)$ For RoPE, $\psi$ is chosen as a pair of phase-shifted Haar-like wavelets (Eq. 9 in (Oka et al., 4 Feb 2025)). The resulting positional encoding matrix is block-diagonal, corresponding to 2D rotations by fixed angles, and is mathematically identical to RoPE when properly parameterized (Eq. 11). This construction reveals RoPE as a single-scale, non-adaptive, head-dimension Haar wavelet transform, limiting its extrapolation and sensitivity to local/global structure.

2. Multi-Scale Wavelet Construction in WIRE for Long-Context Transformers

WIRE generalizes beyond RoPE by performing a full wavelet transform along the (relative) token position axis, introducing both scale and shift parameters. The base wavelet, typically the Ricker or Gaussian, is defined as: $\psi(t) = (1 - t^2) \exp\left(-\frac{t^2}{2}\right)$ Each head dimension $j$ is assigned a unique scale $a_j \in \{2^0, 2^1, \ldots, 2^{s-1}\}$ and shift $b_j \in \{0, \ldots, \frac{d}{s}-1\}$ : $p_{m, n} = \left[\, \psi\!\left(\frac{m-n-b_0}{a_0}\right),\; \psi\!\left(\frac{m-n-b_1}{a_1}\right),\; \ldots,\; \psi\!\left(\frac{m-n-b_{d-1}}{a_{d-1}}\right) \, \right] \in \mathbb{R}^d$ The attention mechanism augments the standard dot product with the wavelet features: $e_{m, n} = \frac{q_m k_n^T + q_m p_{m, n}^T}{\sqrt{d}}$ This multi-scale structure enables the model to encode both coarse (global) and fine (local) positional differences without limiting the receptive field, in contrast to windowed or chunked attention approaches (Oka et al., 4 Feb 2025).

3. WIRE on Graph-Structured Data: Spectral Graph Wavelets

WIRE further generalizes to arbitrary graph-structured inputs. Here, each node $i$ receives a vector of spectral (wavelet) coordinates $r_i \in \mathbb{R}^m$ , derived from the $m$ smallest nontrivial eigenvectors $u_k$ of the graph Laplacian $L = D - A$ : $r_i = (u_1[i],\; u_2[i],\; \ldots,\; u_m[i])$ For each 2-dimensional query/key block $n$ , a learnable frequency vector $w_n \in \mathbb{R}^m$ defines the rotation angle: $\phi_{i, n} = w_n^\top r_i$ The rotary encoding is then: $\operatorname{ROPE}_{\text{WIRE}}(r_i)\,z_i = \bigoplus_{n = 1}^{d/2} R(w_n^\top r_i)\; [z_i]_{2n-2:2n-1}$ where $R(\phi)$ is the standard $2 \times 2$ rotation matrix. This formulation injects graph topology directly into every attention head, encoding locality, global structure, and spectral smoothness (Reid et al., 26 Sep 2025).

4. Algorithmic Implementation and Computational Efficiency

For long-context transformers, WIRE precomputes scale $a_j$ and shift $b_j$ arrays, and computes a $d \times L \times L$ tensor of wavelet responses $P$ per forward pass, with $O(d L^2)$ complexity. With memory optimizations (Appendix A3 in (Oka et al., 4 Feb 2025)), the storage can be reduced to $O(d L)$ via scatter/lookup for relative positions.

In graph settings, one first diagonalizes the Laplacian to obtain $u_k$ ; nodes store $r_i$ of cost $O(N m)$ . The per-layer rotary encoding is performed with a matrix-vector product $W r_i$ (cost $O(d m)$ ), and subsequent $O(d)$ work for the blockwise rotations. No $N \times N$ adjacency or position matrix is expanded. For standard or Performer-style linear attention, WIRE is fully compatible due to rotational invariance.

5. Theoretical Properties and Structural Guarantees

WIRE is permutation equivariant in the graph domain due to Laplacian eigenvector transformation properties under node relabeling. On uniform grids (1D/2D), it recovers the standard RoPE scheme as a special case. Attention scores computed using WIRE have an asymptotic dependence on the effective resistance distance on the graph: $R(i, j) = (e_i - e_j)^\top L^\dagger (e_i - e_j) = \sum_{k=1}^{N-1} \frac{(u_k[i] - u_k[j])^2}{\lambda_k}$ Under random Gaussian initialization of $w_n$ , the expected attenuation of query-key dot products is proportional to $R(i, j)$ , encoding a principled distance bias without explicit attention masks.

On sequences, multi-scale continuous wavelets internalize both smooth (global) and abrupt (local) positional changes, enabling superior extrapolation in both short and long contexts compared to RoPE and ALiBi. Wavelet type impacts performance: Ricker and Gaussian wavelets are most effective; discrete orthogonal wavelets are less robust unless chosen for many vanishing moments.

6. Empirical Results and Practical Guidelines

Summary of Empirical Findings

Setting	Baseline(s)	WIRE Configuration	Performance (Example)
LLMs (Wikitext-103, Llama-2-7B, CodeParrot)	NoPE, RoPE, ALiBi, Trans-XL	Ricker wavelet, 8 scales	Outperforms baselines in all perplexity extrapolations
Point clouds (ModelNet40, ShapeNet)	NoPE, Cartesian RoPE	10 Laplacian eigenvectors	Equal or better than standard position encoding
Graph regression / classification	APE, NoPE, Performer, GPS	$m = 3$ –$10$ spectral dims	$10$– $20\%$ reduction in RMSE, accuracy consistently↑

On sequence tasks of increasing length, WIRE exhibits monotonically better extrapolation compared to RoPE (e.g., perplexity 8.60 vs. 8.90 for Llama-2-7B at $32$k tokens), and on synthetic graph benchmarks, it halves RMSE vs. absolute encoding (Oka et al., 4 Feb 2025, Reid et al., 26 Sep 2025).

Implementation Recommendations

For transformers, select continuous wavelets with one vanishing moment (Ricker or Gaussian); set number of scales $s \approx 8$ (powers of two).
For graphs, precompute $m$ Laplacian eigenvectors ( $3 \leq m \leq 32$ ), allocate learnable $W$ per head/layer as resources permit.
In both modalities, WIRE replaces RoPE or standard positional encodings in all attention layers without increasing asymptotic complexity.
Use vectorized GPU operations for $a_j, b_j$ broadcasting or sparse SVD for eigenvector computation in large graphs.
Monitor memory and employ scatter/lookup optimizations for large $L$ or $N$ .

7. Significance, Extensions, and Open Directions

WIRE unifies temporal, spatial, and topological priors within attention frameworks, extending positional representation from uniform lines and grids to arbitrary graphs. This results in improved generalization when extrapolating beyond training context windows and enhanced inductive bias for structured inputs. The method preserves attention connectivity, maintains compatibility with O( $N$ ) linear attention regimes, and introduces only marginal parameter and runtime overhead. Adaptations include integration with NTK-aware scaling, LongRoPE, or hybrid GNN-attention architectures.

A plausible implication is that further gains may be realized by combining WIRE with model-specific extrapolation techniques or learning dynamic scale/shift parameters. Open questions concern optimal selection of wavelet families and scale regimes for arbitrary input topologies, and efficient scalable eigen-computations for very-large graphs.

WIRE's empirical and theoretical results demonstrate robust advantages for both LLMs and graph-based models, with principled improvements in extrapolation and structural awareness (Oka et al., 4 Feb 2025, Reid et al., 26 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Wavelet-based Positional Representation for Long Context (2025)

Wavelet-Induced Rotary Encodings: RoPE Meets Graphs (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Wavelet-Induced Rotary Encodings (WIRE).

Wavelet-Induced Rotary Encodings (WIRE)

1. Rotary Position Embedding as a Restricted Wavelet Transform

2. Multi-Scale Wavelet Construction in WIRE for Long-Context Transformers

3. WIRE on Graph-Structured Data: Spectral Graph Wavelets

4. Algorithmic Implementation and Computational Efficiency

5. Theoretical Properties and Structural Guarantees

6. Empirical Results and Practical Guidelines

Summary of Empirical Findings

Implementation Recommendations

7. Significance, Extensions, and Open Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Wavelet-Induced Rotary Encodings (WIRE)

1. Rotary Position Embedding as a Restricted Wavelet Transform

2. Multi-Scale Wavelet Construction in WIRE for Long-Context Transformers

3. WIRE on Graph-Structured Data: Spectral Graph Wavelets

4. Algorithmic Implementation and Computational Efficiency

5. Theoretical Properties and Structural Guarantees

6. Empirical Results and Practical Guidelines

Summary of Empirical Findings

Implementation Recommendations

7. Significance, Extensions, and Open Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research