WaveLSFormer: Wavelet-Enhanced Transformers
- The paper introduces WaveLSFormer, which embeds both learnable and fixed wavelet transforms into Transformer blocks to jointly model temporal and frequency information.
- Empirical results show that in equity trading, WaveLSFormer achieved a mean ROI of 0.607 and a Sharpe ratio of 2.157, markedly outperforming standard Transformer and LSTM baselines.
- The architecture employs low-guided high-frequency injection and wavelet-space attention for efficient multi-scale fusion, offering robust long-sequence performance with minimal computational overhead.
WaveLSFormer refers to a family of wavelet-enhanced Transformer architectures in which multi-scale, time-frequency representations are directly integrated into the attention or decision-making mechanisms of neural sequence models. Two prominently cited implementations are (a) WaveLSFormer for long-short equity trading and risk-adjusted return optimization (Li et al., 19 Jan 2026), and (b) Wavelet Space Attention Transformer (WavSpA, sometimes also called WaveLSFormer) for boosting long-sequence learning (Zhuang et al., 2022). These models incorporate learnable or fixed wavelet transforms into the Transformer block structure, enabling joint learning over both temporal and frequency domains.
1. Architectural Foundations
WaveLSFormer structures augment the standard Transformer framework by embedding wavelet transforms as pre-attention or intra-attention modules. This design enables explicit multi-scale representation learning, which is particularly well-suited for domains with non-stationary or compositionally structured input, such as financial time series or long-range sequence tasks.
In the (Li et al., 19 Jan 2026) variant for finance, the model is built around an end-to-end learnable FIR filter bank, parameterized as real-valued kernels . These kernels convolve with the raw log-return input to generate low- and high-frequency components: The frequency response is controlled via spectral-domain regularization (see Section 2).
In the WavSpA block in (Zhuang et al., 2022), the forward Discrete Wavelet Transform (DWT) is applied along the token axis: where is scale, is position, and is the wavelet basis. Attention is computed in the transformed space, and the representation is mapped back to the original space via inverse DWT.
2. Wavelet-Based Feature Extraction and Regularization
Learnable Wavelet Front-End (Li et al., 19 Jan 2026)
The model's FIR filters are optimized not only via task supervision but also a suite of spectral separation regularizers. These are applied using the discrete Fourier transform (DFT) on the kernel coefficients, enabling gradient-based tuning:
- Band separation:
- Orthogonality and energy balancing:
and
The aggregate wavelet loss is:
Fixed and Adaptive Wavelet Bases (Zhuang et al., 2022)
WaveLSFormer supports both fixed orthonormal wavelet bases (e.g., Daubechies-2, Symlet-2) and several adaptive parameterizations:
- Direct: Learn low-pass coefficients ; derive high-pass via QMF relation
- Orthonormal: Parameterize via Givens-rotation / up-shift matrices; filter is always orthonormal
- Lifting: Employ "split→update→predict" steps with learned update/predict scalars
3. Multi-Scale Fusion and Attention Mechanisms
Low-Guided High-Frequency Injection (LGHI) (Li et al., 19 Jan 2026)
Multi-scale fusion is performed via the LGHI module, which fuses low-frequency features with high-frequency features : The scalar is initialized such that is near zero, making the injection initially negligible and stabilizing early training. The backward gradient through is modulated by .
Wavelet-Space Attention (Zhuang et al., 2022)
In the WavSpA block, multi-head scaled-dot-product attention is applied in the wavelet coefficient domain: The updated coefficients are returned to the token domain via inverse DWT.
4. Training Protocols and Optimization Objectives
WaveLSFormer models leverage domain-specific and risk-aware training objectives, in addition to standard supervised learning.
For long-short equity trading (Li et al., 19 Jan 2026), the training objective integrates:
- Supervised soft-label cross-entropy on probabilistic targets , with .
- Overfitting penalty based on deviations of batch ROI from allowable thresholds.
- Sharpe-ratio regularization, promoting high risk-adjusted returns:
with
- Wavelet regularization loss as described above.
A two-phase training schedule is used: spectral loss is optimized first (epochs 1–30), followed by full loss (epochs 31–80), and models are selected by maximum validation ROI.
For WavSpA (Zhuang et al., 2022), the learning objective is standard cross-entropy or MSE on task outputs, with trainable wavelet parameters updated via backpropagation.
5. Computational Complexity and Efficiency
Wavelet transform modules introduce negligible computational overhead compared to the scaling of full attention. The fast wavelet transform and its inverse have cost:
- WavSpA: Total complexity is ; for practical , overhead due to wavelets is inconsequential.
- Overhead: Direct adaptive parameterization adds approximately 12% to training time; lifting-based transforms can reduce runtime (to 64%/67% of baseline for training/inference at level ), since small-band attention matrices dominate.
6. Empirical Performance and Applications
Equity Trading (Li et al., 19 Jan 2026)
On five years of hourly U.S. equity data across six industry universes:
- WaveLSFormer (learnable wavelets + LGHI + Sharpe loss): mean ROI , Sharpe ratio (averaged over 10 seeds and all sectors)
- Baselines:
- Plain Transformer: ROI $0.225$, Sharpe $1.024$
- Transformer + fixed wavelet DWT: ROI $0.346$, Sharpe $1.439$
- LSTM + fixed wavelet: ROI $0.317$, Sharpe $1.879$
- This demonstrates significant advantages in both absolute and risk-adjusted profitability.
Long-Sequence Modeling (Zhuang et al., 2022)
On Long Range Arena (LRA) benchmarks:
- Baseline Transformer: mean test accuracy
- WavSpA with fixed D-2 wavelet:
- AdaWavSpA (direct adaptive): On LEGO chain-of-reasoning, AdaWavSpA generalizes more robustly to longer sequences:
- Transformer: accuracy drops from (14 vars) to (20 vars)
- WavSpA-AdaWavSpA: to
Runtime overhead remains modest even for large ; fixed orthonormal wavelets add latency, direct adaptive filters .
7. Relation to Broader Research and Methodological Variants
Wavelet-based Transformers relate closely to other approaches integrating alternative bases into attention mechanisms—such as Fourier transform-based methods (AFNO, GFNet), which trade-off frequency localization for less precise time localization. WavSpA and related modules exploit wavelet transforms’ capacity to simultaneously encode position and scale, enhancing the model’s ability to capture transients, multi-scale patterns, and nonstationary signals.
Adaptive wavelet parameterizations provide explicit learnable multi-resolution structure, contrasting with fixed-filter approaches, and enabling the representation to adapt to the statistical structure present in financial, text, or reasoning sequences. Notably, the LGHI mechanism in (Li et al., 19 Jan 2026) is a domain-specific development not present in (Zhuang et al., 2022), introduced to mitigate training instability and enhance multi-scale fusion in noisy, regime-switching environments such as high-frequency trading.
A plausible implication is that as sequence tasks become more compositionally structured or exhibit strong multi-scale dependencies, wavelet-augmented attention blocks may supersede purely time or purely frequency-based alternatives in both accuracy and sample-efficiency.