WaveLSFormer: Wavelet-Enhanced Transformers

Updated 26 January 2026

The paper introduces WaveLSFormer, which embeds both learnable and fixed wavelet transforms into Transformer blocks to jointly model temporal and frequency information.
Empirical results show that in equity trading, WaveLSFormer achieved a mean ROI of 0.607 and a Sharpe ratio of 2.157, markedly outperforming standard Transformer and LSTM baselines.
The architecture employs low-guided high-frequency injection and wavelet-space attention for efficient multi-scale fusion, offering robust long-sequence performance with minimal computational overhead.

WaveLSFormer refers to a family of wavelet-enhanced Transformer architectures in which multi-scale, time-frequency representations are directly integrated into the attention or decision-making mechanisms of neural sequence models. Two prominently cited implementations are (a) WaveLSFormer for long-short equity trading and risk-adjusted return optimization (Li et al., 19 Jan 2026), and (b) Wavelet Space Attention Transformer (WavSpA, sometimes also called WaveLSFormer) for boosting long-sequence learning (Zhuang et al., 2022). These models incorporate learnable or fixed wavelet transforms into the Transformer block structure, enabling joint learning over both temporal and frequency domains.

1. Architectural Foundations

WaveLSFormer structures augment the standard Transformer framework by embedding wavelet transforms as pre-attention or intra-attention modules. This design enables explicit multi-scale representation learning, which is particularly well-suited for domains with non-stationary or compositionally structured input, such as financial time series or long-range sequence tasks.

In the (Li et al., 19 Jan 2026) variant for finance, the model is built around an end-to-end learnable FIR filter bank, parameterized as real-valued kernels $\theta^{\mathrm{low}}, \theta^{\mathrm{high}} \in \mathbb{R}^L$ . These kernels convolve with the raw log-return input $x[n]$ to generate low- and high-frequency components: $y^{\mathrm{low}}[n] = \sum_{k=0}^{L-1} \theta^{\mathrm{low}}_k\,x[n-k],\qquad y^{\mathrm{high}}[n] = \sum_{k=0}^{L-1} \theta^{\mathrm{high}}_k\,x[n-k]$ The frequency response is controlled via spectral-domain regularization (see Section 2).

In the WavSpA block in (Zhuang et al., 2022), the forward Discrete Wavelet Transform (DWT) is applied along the token axis: $W(i, j) = \sum_{k=1}^{n} x(t_k) \psi^*_{i, j}(t_k)$ where $i$ is scale, $j$ is position, and $\psi_{i, j}$ is the wavelet basis. Attention is computed in the transformed space, and the representation is mapped back to the original space via inverse DWT.

2. Wavelet-Based Feature Extraction and Regularization

The model's FIR filters are optimized not only via task supervision but also a suite of spectral separation regularizers. These are applied using the discrete Fourier transform (DFT) on the kernel coefficients, enabling gradient-based tuning:

Band separation:

$\mathcal{L}_{\mathrm{low}} = \sum_\omega \left(\frac{\omega}{\pi}\right)^p |G_{\mathrm{low}}(\omega)|^2, \qquad \mathcal{L}_{\mathrm{high}} = \sum_\omega\left(1 - \frac{\omega}{\pi}\right)^p |G_{\mathrm{high}}(\omega)|^2$

Orthogonality and energy balancing:

$\mathcal{L}_{\mathrm{overlap}} = \|G_{\mathrm{low}}\|^2 \,\| G_{\mathrm{high}} \|^2, \qquad \mathcal{L}_{\mathrm{parseval}} = \left( \|G_{\mathrm{low}}\|^2 + \|G_{\mathrm{high}}\|^2 - 2 \right)^2$

and

$\mathcal{L}_{\mathrm{ratio}} = \max(\rho - \rho_{\max}, 0) + \max(\rho_{\min} - \rho, 0)$

The aggregate wavelet loss is: $\mathcal{L}_{\mathrm{wavelet}} = \lambda_{\mathrm{spec}}(\mathcal{L}_{\mathrm{low}} + \mathcal{L}_{\mathrm{high}}) + \mathcal{L}_{\mathrm{overlap}} + \mathcal{L}_{\mathrm{parseval}} + \mathcal{L}_{\mathrm{ratio}}$

WaveLSFormer supports both fixed orthonormal wavelet bases (e.g., Daubechies-2, Symlet-2) and several adaptive parameterizations:

Direct: Learn low-pass coefficients $\phi \in \mathbb{R}^n$ ; derive high-pass via QMF relation
Orthonormal: Parameterize via Givens-rotation / up-shift matrices; filter is always orthonormal
Lifting: Employ "split→update→predict" steps with learned update/predict scalars

3. Multi-Scale Fusion and Attention Mechanisms

Multi-scale fusion is performed via the LGHI module, which fuses low-frequency features $L \in \mathbb{R}^{T \times d}$ with high-frequency features $H$ : $\begin{align*} A(L) &= \mathrm{softmax}\left(\frac{(L W_Q) (L W_K)^\top}{\sqrt{d_k}}\right) \ Z(L,H) &= A(L) (H W_V) W_O \ Y &= L + \beta Z(L, H),\quad \beta = \sigma(\gamma) \end{align*}$ The scalar $\gamma$ is initialized such that $\beta$ is near zero, making the injection initially negligible and stabilizing early training. The backward gradient through $Z$ is modulated by $\beta$ .

In the WavSpA block, multi-head scaled-dot-product attention is applied in the wavelet coefficient domain: $\begin{align*} Q &= W W_q, \quad K = W W_k, \quad V = W W_v \ A &= \mathrm{softmax}(Q K^\top/\sqrt{m}) \ \hat{W} &= A V \end{align*}$ The updated coefficients $\hat{W}$ are returned to the token domain via inverse DWT.

4. Training Protocols and Optimization Objectives

WaveLSFormer models leverage domain-specific and risk-aware training objectives, in addition to standard supervised learning.

For long-short equity trading (Li et al., 19 Jan 2026), the training objective integrates:

Supervised soft-label cross-entropy on probabilistic targets $y_t = \sigma(k \hat{\ell}_t)$ , with $\hat{\ell}_t = \log(1 + r_{j^*, t+1})$ .
Overfitting penalty based on deviations of batch ROI from allowable thresholds.
Sharpe-ratio regularization, promoting high risk-adjusted returns:

$\mathcal{L}_{\mathrm{sharpe}} = \exp\left(-\alpha \min\left(\frac{3}{\sqrt{K}}, \hat{S}\right)\right)$

with

$\hat{S} = \frac{\mathbb{E}[R_p]}{\sqrt{\mathrm{Var}(R_p) + \varepsilon}}$

Wavelet regularization loss as described above.

A two-phase training schedule is used: spectral loss is optimized first (epochs 1–30), followed by full loss (epochs 31–80), and models are selected by maximum validation ROI.

For WavSpA (Zhuang et al., 2022), the learning objective is standard cross-entropy or MSE on task outputs, with trainable wavelet parameters updated via backpropagation.

5. Computational Complexity and Efficiency

Wavelet transform modules introduce negligible computational overhead compared to the $O(H n^2 m)$ scaling of full attention. The fast wavelet transform and its inverse have $O(n d)$ cost:

WavSpA: Total complexity is $O(n d + H n^2 m)$ ; for practical $n \gg d$ , overhead due to wavelets is inconsequential.
Overhead: Direct adaptive parameterization adds approximately 12% to training time; lifting-based transforms can reduce runtime (to 64%/67% of baseline for training/inference at level $L=3$ ), since small-band attention matrices dominate.

6. Empirical Performance and Applications

On five years of hourly U.S. equity data across six industry universes:

WaveLSFormer (learnable wavelets + LGHI + Sharpe loss): mean ROI $0.607 \pm 0.045$ , Sharpe ratio $2.157 \pm 0.166$ (averaged over 10 seeds and all sectors)
Baselines:
- Plain Transformer: ROI $0.225$, Sharpe $1.024$
- Transformer + fixed wavelet DWT: ROI $0.346$, Sharpe $1.439$
- LSTM + fixed wavelet: ROI $0.317$, Sharpe $1.879$
- This demonstrates significant advantages in both absolute and risk-adjusted profitability.

On Long Range Arena (LRA) benchmarks:

Baseline Transformer: mean test accuracy $54.39\%$
WavSpA with fixed D-2 wavelet: $62.90\%$
AdaWavSpA (direct adaptive): $70.59\%$ On LEGO chain-of-reasoning, AdaWavSpA generalizes more robustly to longer sequences:
Transformer: accuracy drops from $96\%$ (14 vars) to $60\%$ (20 vars)
WavSpA-AdaWavSpA: $98\%$ to $80\%$

Runtime overhead remains modest even for large $n$ ; fixed orthonormal wavelets add $2−3\%$ latency, direct adaptive filters $\approx 12\%$ .

7. Relation to Broader Research and Methodological Variants

Wavelet-based Transformers relate closely to other approaches integrating alternative bases into attention mechanisms—such as Fourier transform-based methods (AFNO, GFNet), which trade-off frequency localization for less precise time localization. WavSpA and related modules exploit wavelet transforms’ capacity to simultaneously encode position and scale, enhancing the model’s ability to capture transients, multi-scale patterns, and nonstationary signals.

Adaptive wavelet parameterizations provide explicit learnable multi-resolution structure, contrasting with fixed-filter approaches, and enabling the representation to adapt to the statistical structure present in financial, text, or reasoning sequences. Notably, the LGHI mechanism in (Li et al., 19 Jan 2026) is a domain-specific development not present in (Zhuang et al., 2022), introduced to mitigate training instability and enhance multi-scale fusion in noisy, regime-switching environments such as high-frequency trading.

A plausible implication is that as sequence tasks become more compositionally structured or exhibit strong multi-scale dependencies, wavelet-augmented attention blocks may supersede purely time or purely frequency-based alternatives in both accuracy and sample-efficiency.

Markdown Report Issue Upgrade to Chat

References (2)

A Learnable Wavelet Transformer for Long-Short Equity Trading and Risk-Adjusted Return Optimization (2026)

WavSpA: Wavelet Space Attention for Boosting Transformers' Long Sequence Learning Ability (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WaveLSFormer.

WaveLSFormer: Wavelet-Enhanced Transformers

1. Architectural Foundations

2. Wavelet-Based Feature Extraction and Regularization

Learnable Wavelet Front-End (Li et al., 19 Jan 2026)

Fixed and Adaptive Wavelet Bases (Zhuang et al., 2022)

3. Multi-Scale Fusion and Attention Mechanisms

Low-Guided High-Frequency Injection (LGHI) (Li et al., 19 Jan 2026)

Wavelet-Space Attention (Zhuang et al., 2022)

4. Training Protocols and Optimization Objectives

5. Computational Complexity and Efficiency

6. Empirical Performance and Applications

Equity Trading (Li et al., 19 Jan 2026)

Long-Sequence Modeling (Zhuang et al., 2022)

7. Relation to Broader Research and Methodological Variants

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

WaveLSFormer: Wavelet-Enhanced Transformers

1. Architectural Foundations

2. Wavelet-Based Feature Extraction and Regularization

Learnable Wavelet Front-End (Li et al., 19 Jan 2026)

Fixed and Adaptive Wavelet Bases (Zhuang et al., 2022)

3. Multi-Scale Fusion and Attention Mechanisms

Low-Guided High-Frequency Injection (LGHI) (Li et al., 19 Jan 2026)

Wavelet-Space Attention (Zhuang et al., 2022)

4. Training Protocols and Optimization Objectives

5. Computational Complexity and Efficiency

6. Empirical Performance and Applications

Equity Trading (Li et al., 19 Jan 2026)

Long-Sequence Modeling (Zhuang et al., 2022)

7. Relation to Broader Research and Methodological Variants

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research