Papers
Topics
Authors
Recent
Search
2000 character limit reached

Wavelet-Induced Rotary Encodings (WIRE)

Updated 20 January 2026
  • WIRE is a positional and structural encoding framework that generalizes RoPE using multi-scale continuous wavelet transforms and Laplacian spectral representations.
  • For long-context transformers, it employs unique scale and shift parameters to capture both global and local features, ensuring robust extrapolation.
  • On graph data, WIRE integrates Laplacian eigenvector coordinates to inject permutation equivariance and principled distance bias into attention mechanisms.

Wavelet-Induced Rotary Encodings (WIRE) are a positional and structural encoding framework for attention-based models, generalizing rotary position embedding (RoPE) to both long-context sequential modeling and arbitrary graph-structured data. WIRE leverages multi-scale and spectral wavelet representations to achieve robust extrapolation, inherent permutation equivariance, and principled adaptation to non-stationary inputs. The method has been advanced in two principal directions: WIRE for long-context LLMs via multi-scale continuous wavelet transforms, and WIRE for graphs via Laplacian spectral coordinates, unifying geometric and topological inductive biases within attention mechanisms (Oka et al., 4 Feb 2025, Reid et al., 26 Sep 2025).

1. Rotary Position Embedding as a Restricted Wavelet Transform

RoPE, the dominant mechanism in many LLMs and Vision Transformers (ViTs), encodes positional information by applying blockwise 2D rotation matrices along each query and key subvector, parameterized by fixed frequencies. RoPE can be formally re-expressed as a Haar-like continuous wavelet transform of the input signal x(t)x(t) with scale a=1a=1, where positional encoding corresponds to evaluating the transform at a fixed dyadic window and specified shifts: W(a,b)=t=0T1ψa,b(t)x(t),    ψa,b(t)=1aψ(tba)W(a, b) = \sum_{t = 0}^{T-1} \psi_{a, b}(t) x(t), \;\; \psi_{a, b}(t) = \frac{1}{\sqrt{a}} \psi\left(\frac{t - b}{a}\right) For RoPE, ψ\psi is chosen as a pair of phase-shifted Haar-like wavelets (Eq. 9 in (Oka et al., 4 Feb 2025)). The resulting positional encoding matrix is block-diagonal, corresponding to 2D rotations by fixed angles, and is mathematically identical to RoPE when properly parameterized (Eq. 11). This construction reveals RoPE as a single-scale, non-adaptive, head-dimension Haar wavelet transform, limiting its extrapolation and sensitivity to local/global structure.

2. Multi-Scale Wavelet Construction in WIRE for Long-Context Transformers

WIRE generalizes beyond RoPE by performing a full wavelet transform along the (relative) token position axis, introducing both scale and shift parameters. The base wavelet, typically the Ricker or Gaussian, is defined as: ψ(t)=(1t2)exp(t22)\psi(t) = (1 - t^2) \exp\left(-\frac{t^2}{2}\right) Each head dimension jj is assigned a unique scale aj{20,21,,2s1}a_j \in \{2^0, 2^1, \ldots, 2^{s-1}\} and shift bj{0,,ds1}b_j \in \{0, \ldots, \frac{d}{s}-1\}: pm,n=[ψ ⁣(mnb0a0),  ψ ⁣(mnb1a1),  ,  ψ ⁣(mnbd1ad1)]Rdp_{m, n} = \left[\, \psi\!\left(\frac{m-n-b_0}{a_0}\right),\; \psi\!\left(\frac{m-n-b_1}{a_1}\right),\; \ldots,\; \psi\!\left(\frac{m-n-b_{d-1}}{a_{d-1}}\right) \, \right] \in \mathbb{R}^d The attention mechanism augments the standard dot product with the wavelet features: em,n=qmknT+qmpm,nTde_{m, n} = \frac{q_m k_n^T + q_m p_{m, n}^T}{\sqrt{d}} This multi-scale structure enables the model to encode both coarse (global) and fine (local) positional differences without limiting the receptive field, in contrast to windowed or chunked attention approaches (Oka et al., 4 Feb 2025).

3. WIRE on Graph-Structured Data: Spectral Graph Wavelets

WIRE further generalizes to arbitrary graph-structured inputs. Here, each node ii receives a vector of spectral (wavelet) coordinates riRmr_i \in \mathbb{R}^m, derived from the mm smallest nontrivial eigenvectors uku_k of the graph Laplacian L=DAL = D - A: ri=(u1[i],  u2[i],  ,  um[i])r_i = (u_1[i],\; u_2[i],\; \ldots,\; u_m[i]) For each 2-dimensional query/key block nn, a learnable frequency vector wnRmw_n \in \mathbb{R}^m defines the rotation angle: ϕi,n=wnri\phi_{i, n} = w_n^\top r_i The rotary encoding is then: ROPEWIRE(ri)zi=n=1d/2R(wnri)  [zi]2n2:2n1\operatorname{ROPE}_{\text{WIRE}}(r_i)\,z_i = \bigoplus_{n = 1}^{d/2} R(w_n^\top r_i)\; [z_i]_{2n-2:2n-1} where R(ϕ)R(\phi) is the standard 2×22 \times 2 rotation matrix. This formulation injects graph topology directly into every attention head, encoding locality, global structure, and spectral smoothness (Reid et al., 26 Sep 2025).

4. Algorithmic Implementation and Computational Efficiency

For long-context transformers, WIRE precomputes scale aja_j and shift bjb_j arrays, and computes a d×L×Ld \times L \times L tensor of wavelet responses PP per forward pass, with O(dL2)O(d L^2) complexity. With memory optimizations (Appendix A3 in (Oka et al., 4 Feb 2025)), the storage can be reduced to O(dL)O(d L) via scatter/lookup for relative positions.

In graph settings, one first diagonalizes the Laplacian to obtain uku_k; nodes store rir_i of cost O(Nm)O(N m). The per-layer rotary encoding is performed with a matrix-vector product WriW r_i (cost O(dm)O(d m)), and subsequent O(d)O(d) work for the blockwise rotations. No N×NN \times N adjacency or position matrix is expanded. For standard or Performer-style linear attention, WIRE is fully compatible due to rotational invariance.

5. Theoretical Properties and Structural Guarantees

WIRE is permutation equivariant in the graph domain due to Laplacian eigenvector transformation properties under node relabeling. On uniform grids (1D/2D), it recovers the standard RoPE scheme as a special case. Attention scores computed using WIRE have an asymptotic dependence on the effective resistance distance on the graph: R(i,j)=(eiej)L(eiej)=k=1N1(uk[i]uk[j])2λkR(i, j) = (e_i - e_j)^\top L^\dagger (e_i - e_j) = \sum_{k=1}^{N-1} \frac{(u_k[i] - u_k[j])^2}{\lambda_k} Under random Gaussian initialization of wnw_n, the expected attenuation of query-key dot products is proportional to R(i,j)R(i, j), encoding a principled distance bias without explicit attention masks.

On sequences, multi-scale continuous wavelets internalize both smooth (global) and abrupt (local) positional changes, enabling superior extrapolation in both short and long contexts compared to RoPE and ALiBi. Wavelet type impacts performance: Ricker and Gaussian wavelets are most effective; discrete orthogonal wavelets are less robust unless chosen for many vanishing moments.

6. Empirical Results and Practical Guidelines

Summary of Empirical Findings

Setting Baseline(s) WIRE Configuration Performance (Example)
LLMs (Wikitext-103, Llama-2-7B, CodeParrot) NoPE, RoPE, ALiBi, Trans-XL Ricker wavelet, 8 scales Outperforms baselines in all perplexity extrapolations
Point clouds (ModelNet40, ShapeNet) NoPE, Cartesian RoPE 10 Laplacian eigenvectors Equal or better than standard position encoding
Graph regression / classification APE, NoPE, Performer, GPS m=3m = 3–$10$ spectral dims $10$–20%20\% reduction in RMSE, accuracy consistently↑

On sequence tasks of increasing length, WIRE exhibits monotonically better extrapolation compared to RoPE (e.g., perplexity 8.60 vs. 8.90 for Llama-2-7B at $32$k tokens), and on synthetic graph benchmarks, it halves RMSE vs. absolute encoding (Oka et al., 4 Feb 2025, Reid et al., 26 Sep 2025).

Implementation Recommendations

  • For transformers, select continuous wavelets with one vanishing moment (Ricker or Gaussian); set number of scales s8s \approx 8 (powers of two).
  • For graphs, precompute mm Laplacian eigenvectors (3m323 \leq m \leq 32), allocate learnable WW per head/layer as resources permit.
  • In both modalities, WIRE replaces RoPE or standard positional encodings in all attention layers without increasing asymptotic complexity.
  • Use vectorized GPU operations for aj,bja_j, b_j broadcasting or sparse SVD for eigenvector computation in large graphs.
  • Monitor memory and employ scatter/lookup optimizations for large LL or NN.

7. Significance, Extensions, and Open Directions

WIRE unifies temporal, spatial, and topological priors within attention frameworks, extending positional representation from uniform lines and grids to arbitrary graphs. This results in improved generalization when extrapolating beyond training context windows and enhanced inductive bias for structured inputs. The method preserves attention connectivity, maintains compatibility with O(NN) linear attention regimes, and introduces only marginal parameter and runtime overhead. Adaptations include integration with NTK-aware scaling, LongRoPE, or hybrid GNN-attention architectures.

A plausible implication is that further gains may be realized by combining WIRE with model-specific extrapolation techniques or learning dynamic scale/shift parameters. Open questions concern optimal selection of wavelet families and scale regimes for arbitrary input topologies, and efficient scalable eigen-computations for very-large graphs.

WIRE's empirical and theoretical results demonstrate robust advantages for both LLMs and graph-based models, with principled improvements in extrapolation and structural awareness (Oka et al., 4 Feb 2025, Reid et al., 26 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Wavelet-Induced Rotary Encodings (WIRE).