Papers
Topics
Authors
Recent
Search
2000 character limit reached

WaveLSFormer: Wavelet-Enhanced Transformers

Updated 26 January 2026
  • The paper introduces WaveLSFormer, which embeds both learnable and fixed wavelet transforms into Transformer blocks to jointly model temporal and frequency information.
  • Empirical results show that in equity trading, WaveLSFormer achieved a mean ROI of 0.607 and a Sharpe ratio of 2.157, markedly outperforming standard Transformer and LSTM baselines.
  • The architecture employs low-guided high-frequency injection and wavelet-space attention for efficient multi-scale fusion, offering robust long-sequence performance with minimal computational overhead.

WaveLSFormer refers to a family of wavelet-enhanced Transformer architectures in which multi-scale, time-frequency representations are directly integrated into the attention or decision-making mechanisms of neural sequence models. Two prominently cited implementations are (a) WaveLSFormer for long-short equity trading and risk-adjusted return optimization (Li et al., 19 Jan 2026), and (b) Wavelet Space Attention Transformer (WavSpA, sometimes also called WaveLSFormer) for boosting long-sequence learning (Zhuang et al., 2022). These models incorporate learnable or fixed wavelet transforms into the Transformer block structure, enabling joint learning over both temporal and frequency domains.

1. Architectural Foundations

WaveLSFormer structures augment the standard Transformer framework by embedding wavelet transforms as pre-attention or intra-attention modules. This design enables explicit multi-scale representation learning, which is particularly well-suited for domains with non-stationary or compositionally structured input, such as financial time series or long-range sequence tasks.

In the (Li et al., 19 Jan 2026) variant for finance, the model is built around an end-to-end learnable FIR filter bank, parameterized as real-valued kernels θlow,θhighRL\theta^{\mathrm{low}}, \theta^{\mathrm{high}} \in \mathbb{R}^L. These kernels convolve with the raw log-return input x[n]x[n] to generate low- and high-frequency components: ylow[n]=k=0L1θklowx[nk],yhigh[n]=k=0L1θkhighx[nk]y^{\mathrm{low}}[n] = \sum_{k=0}^{L-1} \theta^{\mathrm{low}}_k\,x[n-k],\qquad y^{\mathrm{high}}[n] = \sum_{k=0}^{L-1} \theta^{\mathrm{high}}_k\,x[n-k] The frequency response is controlled via spectral-domain regularization (see Section 2).

In the WavSpA block in (Zhuang et al., 2022), the forward Discrete Wavelet Transform (DWT) is applied along the token axis: W(i,j)=k=1nx(tk)ψi,j(tk)W(i, j) = \sum_{k=1}^{n} x(t_k) \psi^*_{i, j}(t_k) where ii is scale, jj is position, and ψi,j\psi_{i, j} is the wavelet basis. Attention is computed in the transformed space, and the representation is mapped back to the original space via inverse DWT.

2. Wavelet-Based Feature Extraction and Regularization

The model's FIR filters are optimized not only via task supervision but also a suite of spectral separation regularizers. These are applied using the discrete Fourier transform (DFT) on the kernel coefficients, enabling gradient-based tuning:

  • Band separation:

Llow=ω(ωπ)pGlow(ω)2,Lhigh=ω(1ωπ)pGhigh(ω)2\mathcal{L}_{\mathrm{low}} = \sum_\omega \left(\frac{\omega}{\pi}\right)^p |G_{\mathrm{low}}(\omega)|^2, \qquad \mathcal{L}_{\mathrm{high}} = \sum_\omega\left(1 - \frac{\omega}{\pi}\right)^p |G_{\mathrm{high}}(\omega)|^2

  • Orthogonality and energy balancing:

Loverlap=Glow2Ghigh2,Lparseval=(Glow2+Ghigh22)2\mathcal{L}_{\mathrm{overlap}} = \|G_{\mathrm{low}}\|^2 \,\| G_{\mathrm{high}} \|^2, \qquad \mathcal{L}_{\mathrm{parseval}} = \left( \|G_{\mathrm{low}}\|^2 + \|G_{\mathrm{high}}\|^2 - 2 \right)^2

and

Lratio=max(ρρmax,0)+max(ρminρ,0)\mathcal{L}_{\mathrm{ratio}} = \max(\rho - \rho_{\max}, 0) + \max(\rho_{\min} - \rho, 0)

The aggregate wavelet loss is: Lwavelet=λspec(Llow+Lhigh)+Loverlap+Lparseval+Lratio\mathcal{L}_{\mathrm{wavelet}} = \lambda_{\mathrm{spec}}(\mathcal{L}_{\mathrm{low}} + \mathcal{L}_{\mathrm{high}}) + \mathcal{L}_{\mathrm{overlap}} + \mathcal{L}_{\mathrm{parseval}} + \mathcal{L}_{\mathrm{ratio}}

WaveLSFormer supports both fixed orthonormal wavelet bases (e.g., Daubechies-2, Symlet-2) and several adaptive parameterizations:

  • Direct: Learn low-pass coefficients ϕRn\phi \in \mathbb{R}^n; derive high-pass via QMF relation
  • Orthonormal: Parameterize via Givens-rotation / up-shift matrices; filter is always orthonormal
  • Lifting: Employ "split→update→predict" steps with learned update/predict scalars

3. Multi-Scale Fusion and Attention Mechanisms

Multi-scale fusion is performed via the LGHI module, which fuses low-frequency features LRT×dL \in \mathbb{R}^{T \times d} with high-frequency features HH: A(L)=softmax((LWQ)(LWK)dk) Z(L,H)=A(L)(HWV)WO Y=L+βZ(L,H),β=σ(γ)\begin{align*} A(L) &= \mathrm{softmax}\left(\frac{(L W_Q) (L W_K)^\top}{\sqrt{d_k}}\right) \ Z(L,H) &= A(L) (H W_V) W_O \ Y &= L + \beta Z(L, H),\quad \beta = \sigma(\gamma) \end{align*} The scalar γ\gamma is initialized such that β\beta is near zero, making the injection initially negligible and stabilizing early training. The backward gradient through ZZ is modulated by β\beta.

In the WavSpA block, multi-head scaled-dot-product attention is applied in the wavelet coefficient domain: Q=WWq,K=WWk,V=WWv A=softmax(QK/m) W^=AV\begin{align*} Q &= W W_q, \quad K = W W_k, \quad V = W W_v \ A &= \mathrm{softmax}(Q K^\top/\sqrt{m}) \ \hat{W} &= A V \end{align*} The updated coefficients W^\hat{W} are returned to the token domain via inverse DWT.

4. Training Protocols and Optimization Objectives

WaveLSFormer models leverage domain-specific and risk-aware training objectives, in addition to standard supervised learning.

For long-short equity trading (Li et al., 19 Jan 2026), the training objective integrates:

  • Supervised soft-label cross-entropy on probabilistic targets yt=σ(k^t)y_t = \sigma(k \hat{\ell}_t), with ^t=log(1+rj,t+1)\hat{\ell}_t = \log(1 + r_{j^*, t+1}).
  • Overfitting penalty based on deviations of batch ROI from allowable thresholds.
  • Sharpe-ratio regularization, promoting high risk-adjusted returns:

Lsharpe=exp(αmin(3K,S^))\mathcal{L}_{\mathrm{sharpe}} = \exp\left(-\alpha \min\left(\frac{3}{\sqrt{K}}, \hat{S}\right)\right)

with

S^=E[Rp]Var(Rp)+ε\hat{S} = \frac{\mathbb{E}[R_p]}{\sqrt{\mathrm{Var}(R_p) + \varepsilon}}

A two-phase training schedule is used: spectral loss is optimized first (epochs 1–30), followed by full loss (epochs 31–80), and models are selected by maximum validation ROI.

For WavSpA (Zhuang et al., 2022), the learning objective is standard cross-entropy or MSE on task outputs, with trainable wavelet parameters updated via backpropagation.

5. Computational Complexity and Efficiency

Wavelet transform modules introduce negligible computational overhead compared to the O(Hn2m)O(H n^2 m) scaling of full attention. The fast wavelet transform and its inverse have O(nd)O(n d) cost:

  • WavSpA: Total complexity is O(nd+Hn2m)O(n d + H n^2 m); for practical ndn \gg d, overhead due to wavelets is inconsequential.
  • Overhead: Direct adaptive parameterization adds approximately 12% to training time; lifting-based transforms can reduce runtime (to 64%/67% of baseline for training/inference at level L=3L=3), since small-band attention matrices dominate.

6. Empirical Performance and Applications

On five years of hourly U.S. equity data across six industry universes:

  • WaveLSFormer (learnable wavelets + LGHI + Sharpe loss): mean ROI 0.607±0.0450.607 \pm 0.045, Sharpe ratio 2.157±0.1662.157 \pm 0.166 (averaged over 10 seeds and all sectors)
  • Baselines:
    • Plain Transformer: ROI $0.225$, Sharpe $1.024$
    • Transformer + fixed wavelet DWT: ROI $0.346$, Sharpe $1.439$
    • LSTM + fixed wavelet: ROI $0.317$, Sharpe $1.879$
    • This demonstrates significant advantages in both absolute and risk-adjusted profitability.

On Long Range Arena (LRA) benchmarks:

  • Baseline Transformer: mean test accuracy 54.39%54.39\%
  • WavSpA with fixed D-2 wavelet: 62.90%62.90\%
  • AdaWavSpA (direct adaptive): 70.59%70.59\% On LEGO chain-of-reasoning, AdaWavSpA generalizes more robustly to longer sequences:
  • Transformer: accuracy drops from 96%96\% (14 vars) to 60%60\% (20 vars)
  • WavSpA-AdaWavSpA: 98%98\% to 80%80\%

Runtime overhead remains modest even for large nn; fixed orthonormal wavelets add 23%2−3\% latency, direct adaptive filters 12%\approx 12\%.

7. Relation to Broader Research and Methodological Variants

Wavelet-based Transformers relate closely to other approaches integrating alternative bases into attention mechanisms—such as Fourier transform-based methods (AFNO, GFNet), which trade-off frequency localization for less precise time localization. WavSpA and related modules exploit wavelet transforms’ capacity to simultaneously encode position and scale, enhancing the model’s ability to capture transients, multi-scale patterns, and nonstationary signals.

Adaptive wavelet parameterizations provide explicit learnable multi-resolution structure, contrasting with fixed-filter approaches, and enabling the representation to adapt to the statistical structure present in financial, text, or reasoning sequences. Notably, the LGHI mechanism in (Li et al., 19 Jan 2026) is a domain-specific development not present in (Zhuang et al., 2022), introduced to mitigate training instability and enhance multi-scale fusion in noisy, regime-switching environments such as high-frequency trading.

A plausible implication is that as sequence tasks become more compositionally structured or exhibit strong multi-scale dependencies, wavelet-augmented attention blocks may supersede purely time or purely frequency-based alternatives in both accuracy and sample-efficiency.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WaveLSFormer.