Wavelet-Induced Rotary Encodings (WIRE)
- WIRE is a positional and structural encoding framework that generalizes RoPE using multi-scale continuous wavelet transforms and Laplacian spectral representations.
- For long-context transformers, it employs unique scale and shift parameters to capture both global and local features, ensuring robust extrapolation.
- On graph data, WIRE integrates Laplacian eigenvector coordinates to inject permutation equivariance and principled distance bias into attention mechanisms.
Wavelet-Induced Rotary Encodings (WIRE) are a positional and structural encoding framework for attention-based models, generalizing rotary position embedding (RoPE) to both long-context sequential modeling and arbitrary graph-structured data. WIRE leverages multi-scale and spectral wavelet representations to achieve robust extrapolation, inherent permutation equivariance, and principled adaptation to non-stationary inputs. The method has been advanced in two principal directions: WIRE for long-context LLMs via multi-scale continuous wavelet transforms, and WIRE for graphs via Laplacian spectral coordinates, unifying geometric and topological inductive biases within attention mechanisms (Oka et al., 4 Feb 2025, Reid et al., 26 Sep 2025).
1. Rotary Position Embedding as a Restricted Wavelet Transform
RoPE, the dominant mechanism in many LLMs and Vision Transformers (ViTs), encodes positional information by applying blockwise 2D rotation matrices along each query and key subvector, parameterized by fixed frequencies. RoPE can be formally re-expressed as a Haar-like continuous wavelet transform of the input signal with scale , where positional encoding corresponds to evaluating the transform at a fixed dyadic window and specified shifts: For RoPE, is chosen as a pair of phase-shifted Haar-like wavelets (Eq. 9 in (Oka et al., 4 Feb 2025)). The resulting positional encoding matrix is block-diagonal, corresponding to 2D rotations by fixed angles, and is mathematically identical to RoPE when properly parameterized (Eq. 11). This construction reveals RoPE as a single-scale, non-adaptive, head-dimension Haar wavelet transform, limiting its extrapolation and sensitivity to local/global structure.
2. Multi-Scale Wavelet Construction in WIRE for Long-Context Transformers
WIRE generalizes beyond RoPE by performing a full wavelet transform along the (relative) token position axis, introducing both scale and shift parameters. The base wavelet, typically the Ricker or Gaussian, is defined as: Each head dimension is assigned a unique scale and shift : The attention mechanism augments the standard dot product with the wavelet features: This multi-scale structure enables the model to encode both coarse (global) and fine (local) positional differences without limiting the receptive field, in contrast to windowed or chunked attention approaches (Oka et al., 4 Feb 2025).
3. WIRE on Graph-Structured Data: Spectral Graph Wavelets
WIRE further generalizes to arbitrary graph-structured inputs. Here, each node receives a vector of spectral (wavelet) coordinates , derived from the smallest nontrivial eigenvectors of the graph Laplacian : For each 2-dimensional query/key block , a learnable frequency vector defines the rotation angle: The rotary encoding is then: where is the standard rotation matrix. This formulation injects graph topology directly into every attention head, encoding locality, global structure, and spectral smoothness (Reid et al., 26 Sep 2025).
4. Algorithmic Implementation and Computational Efficiency
For long-context transformers, WIRE precomputes scale and shift arrays, and computes a tensor of wavelet responses per forward pass, with complexity. With memory optimizations (Appendix A3 in (Oka et al., 4 Feb 2025)), the storage can be reduced to via scatter/lookup for relative positions.
In graph settings, one first diagonalizes the Laplacian to obtain ; nodes store of cost . The per-layer rotary encoding is performed with a matrix-vector product (cost ), and subsequent work for the blockwise rotations. No adjacency or position matrix is expanded. For standard or Performer-style linear attention, WIRE is fully compatible due to rotational invariance.
5. Theoretical Properties and Structural Guarantees
WIRE is permutation equivariant in the graph domain due to Laplacian eigenvector transformation properties under node relabeling. On uniform grids (1D/2D), it recovers the standard RoPE scheme as a special case. Attention scores computed using WIRE have an asymptotic dependence on the effective resistance distance on the graph: Under random Gaussian initialization of , the expected attenuation of query-key dot products is proportional to , encoding a principled distance bias without explicit attention masks.
On sequences, multi-scale continuous wavelets internalize both smooth (global) and abrupt (local) positional changes, enabling superior extrapolation in both short and long contexts compared to RoPE and ALiBi. Wavelet type impacts performance: Ricker and Gaussian wavelets are most effective; discrete orthogonal wavelets are less robust unless chosen for many vanishing moments.
6. Empirical Results and Practical Guidelines
Summary of Empirical Findings
| Setting | Baseline(s) | WIRE Configuration | Performance (Example) |
|---|---|---|---|
| LLMs (Wikitext-103, Llama-2-7B, CodeParrot) | NoPE, RoPE, ALiBi, Trans-XL | Ricker wavelet, 8 scales | Outperforms baselines in all perplexity extrapolations |
| Point clouds (ModelNet40, ShapeNet) | NoPE, Cartesian RoPE | 10 Laplacian eigenvectors | Equal or better than standard position encoding |
| Graph regression / classification | APE, NoPE, Performer, GPS | –$10$ spectral dims | $10$– reduction in RMSE, accuracy consistently↑ |
On sequence tasks of increasing length, WIRE exhibits monotonically better extrapolation compared to RoPE (e.g., perplexity 8.60 vs. 8.90 for Llama-2-7B at $32$k tokens), and on synthetic graph benchmarks, it halves RMSE vs. absolute encoding (Oka et al., 4 Feb 2025, Reid et al., 26 Sep 2025).
Implementation Recommendations
- For transformers, select continuous wavelets with one vanishing moment (Ricker or Gaussian); set number of scales (powers of two).
- For graphs, precompute Laplacian eigenvectors (), allocate learnable per head/layer as resources permit.
- In both modalities, WIRE replaces RoPE or standard positional encodings in all attention layers without increasing asymptotic complexity.
- Use vectorized GPU operations for broadcasting or sparse SVD for eigenvector computation in large graphs.
- Monitor memory and employ scatter/lookup optimizations for large or .
7. Significance, Extensions, and Open Directions
WIRE unifies temporal, spatial, and topological priors within attention frameworks, extending positional representation from uniform lines and grids to arbitrary graphs. This results in improved generalization when extrapolating beyond training context windows and enhanced inductive bias for structured inputs. The method preserves attention connectivity, maintains compatibility with O() linear attention regimes, and introduces only marginal parameter and runtime overhead. Adaptations include integration with NTK-aware scaling, LongRoPE, or hybrid GNN-attention architectures.
A plausible implication is that further gains may be realized by combining WIRE with model-specific extrapolation techniques or learning dynamic scale/shift parameters. Open questions concern optimal selection of wavelet families and scale regimes for arbitrary input topologies, and efficient scalable eigen-computations for very-large graphs.
WIRE's empirical and theoretical results demonstrate robust advantages for both LLMs and graph-based models, with principled improvements in extrapolation and structural awareness (Oka et al., 4 Feb 2025, Reid et al., 26 Sep 2025).