Papers
Topics
Authors
Recent
Search
2000 character limit reached

AutoHFormer: Efficient Time-Series Transformer

Updated 12 February 2026
  • The paper presents a hierarchical autoregressive modeling framework that decomposes forecasting into segment-level blocks to refine predictions and reduce error accumulation.
  • AutoHFormer introduces dynamic windowed masked attention with an exponential decay mechanism to enforce strict temporal causality and improve computational efficiency.
  • Empirical evaluations demonstrate that AutoHFormer achieves significant speedups, lower memory usage, and improved accuracy compared to full attention models on benchmark datasets.

AutoHFormer is a Transformer-based architecture for long-horizon time-series forecasting that simultaneously enforces strict temporal causality, achieves sub-quadratic complexity, and captures multi-scale temporal patterns. It is characterized by its hierarchical autoregressive generative mechanism, dynamic windowed masked attention with exponential decay, and a hybrid temporal encoding scheme. These design elements collectively address the fundamental requirements for reliable, scalable, and precise time-series prediction (Zhang et al., 19 Jun 2025).

1. Hierarchical Autoregressive Modeling Framework

AutoHFormer adopts a hierarchical generation scheme that decomposes the forecasting problem into segment-level blocks, with each block processed via an initial summary followed by step-wise autoregressive refinement. Given an input history X1:LRL×VX_{1:L}\in\mathbb{R}^{L\times V} and total forecast length Ttotal=KHT_{\text{total}}=K\cdot H (where KK is the number of segments and HH is the segment length), the model factorizes the predictive distribution as

p(Y^X1:L)=h=1Kp(Y^hX1:L,Y^1:h1)=h=1Kt=1Hp(y^(h1)H+tX1:L,Y^1:h1,y^(h1)H+1:(h1)H+t1)p(\hat{Y} \mid X_{1:L}) = \prod_{h=1}^K p(\hat{Y}_h \mid X_{1:L},\hat{Y}_{1:h-1}) = \prod_{h=1}^K \prod_{t=1}^H p(\hat{y}_{(h-1)H+t} \mid X_{1:L},\hat{Y}_{1:h-1},\hat{y}_{(h-1)H+1:(h-1)H+t-1})

Segment-level initialization is executed through a block-wise Transformer: Y^hinit=Fθ(Ch)RH×d\hat{Y}_h^{\text{init}} = \mathcal{F}_\theta(C_h) \in \mathbb{R}^{H \times d} where Ch=Concat(X1:L,Y^1:h1)R(L+(h1)H)×dC_h = \text{Concat}(X_{1:L}, \hat{Y}_{1:h-1}) \in \mathbb{R}^{(L + (h-1)H) \times d}. Within each segment hh, step-wise refinement generates each scalar forecast recursively: Cht=Concat(Ch,y^(h1)H+1,,y^(h1)H+t1)C_h^t = \text{Concat}(C_h, \hat{y}_{(h-1)H+1}, \ldots, \hat{y}_{(h-1)H+t-1})

ot=WindowedAttention(Cht,Cht,Cht;At)o_t = \text{WindowedAttention}(C_h^t, C_h^t, C_h^t;A_t)

ft=FFN(LayerNorm(ot+Cht[1]))f_t = \text{FFN}(\text{LayerNorm}(o_t + C_h^t[-1]))

y^(h1)H+t=Woft[1],WoRd×V\hat{y}_{(h-1)H+t} = W_o f_t[-1], \quad W_o\in\mathbb{R}^{d\times V}

This hybrid approach reduces error accumulation typical of sequential autoregression by correcting each step with windowed context and parameter sharing across segment steps.

2. Dynamic Windowed Masked Attention Mechanism

AutoHFormer replaces full self-attention with Dynamic Windowed Masked Attention (DWMA), wherein at each time step tt (of ChtC_h^t), attention is restricted to a causal window: Wt={tmax(1,tW/2)tt}\mathcal{W}_t = \{t' \mid \max(1, t - W/2) \leq t' \leq t\} No anti-causal (future) information is leaked. Within this window, attention between tt and tt' is modulated by an exponential decay kernel with a learnable rate γ>0\gamma > 0: τ(t,t)=exp(γtt)\tau(t,t') = \exp(-\gamma\cdot|t-t'|) The attention score matrix thus becomes: At,t=softmaxtWt(Qt(Kt+Rt,t)τ(t,t)dk)A_{t,t'} = \text{softmax}_{t'\in\mathcal{W}_t} \left(\frac{Q_t(K_{t'} + R_{t,t'})^\top \cdot \tau(t,t')}{\sqrt{d_k}}\right) where Qt,KtRdkQ_t, K_{t'}\in\mathbb{R}^{d_k} and Rt,tRdkR_{t,t'}\in\mathbb{R}^{d_k} are the relative-position encodings. As LL positions attend to at most WW others, the time and space complexity is reduced to O(LWd)\mathcal{O}(L\cdot W\cdot d), significantly below the O(L2d)\mathcal{O}(L^2\cdot d) of vanilla self-attention.

3. Adaptive Temporal Encoding

The model incorporates a hybrid positional encoding to simultaneously model short-term transients and long-term temporal dynamics. For position pair (t,t)(t, t'), a fixed sinusoidal term is precomputed: PE(t,t),2i=sin(tt100002i/d),PE(t,t),2i+1=cos(tt100002i/d)\text{PE}_{(t,t'),2i} = \sin\left(\frac{t-t'}{10000^{2i/d}}\right), \qquad \text{PE}_{(t,t'),2i+1} = \cos\left(\frac{t-t'}{10000^{2i/d}}\right) for i=0d/21i=0\ldots d/2-1. These are stored in a lookup table and provide continuous, shift-invariant encodings across lags. The learnable decay parameter γ\gamma is trained jointly, allowing the effective receptive field to adapt per dataset by tuning the influence of distal context: τ(t,t)=exp(γtt)\tau(t, t') = \exp(-\gamma|t-t'|)

4. Enforcement of Strict Temporal Causality

Temporal causality is strictly enforced by two complementary mechanisms:

  • A hard attention mask MhM_h zeroes all matrix entries where tt' exceeds tt in both the segment-level and intra-segment passes, eliminating anti-causal dependencies.
  • The causal window Wt\mathcal{W}_t only includes current and preceding time indices by construction.

Consequently, every forecast y^t\hat{y}_t is a measurable function of only X<tX_{<t} and y^<t\hat{y}_{<t}, preserving p(y^tX<t,y^<t)p(\hat{y}_t|X_{<t},\hat{y}_{<t}).

5. Computational and Memory Efficiency

AutoHFormer achieves sub-quadratic runtime and space scaling owing to its windowed attention. The key complexities for various models are summarized as follows:

Model Time Complexity Space Complexity
Full Attention O(L2d)\mathcal{O}(L^2d) O(L2d)\mathcal{O}(L^2d)
AutoHFormer O(LWd)\mathcal{O}(L W d) O(LWd)\mathcal{O}(L W d)
RNN-based O(Ld2)\mathcal{O}(L d^2) O(d)\mathcal{O}(d)

With L=1024L=1024, W=32W=32, d=64d=64, AutoHFormer achieves up to 32×32\times speedups over full self-attention. On the PEMS08 dataset with L=336,H=48L=336, H=48, it is 10.76×10.76\times faster per epoch (4.58s vs. 49.3s) and uses 6.06×6.06\times less GPU memory (2.99 GB vs. 18.13 GB) than PatchTST, while maintaining or improving predictive accuracy (Zhang et al., 19 Jun 2025).

6. Empirical Performance and Benchmark Results

Comprehensive evaluations were performed on eight public benchmarks: ETTh1/2, ETTm1/2, PEMS04/08, Weather, and Electricity datasets, and four horizons T{96,192,336,720}T \in \{96,192,336,720\}. Under autoregressive evaluation, AutoHFormer achieved first rank in $68/80$ cases. On PEMS08, it reported MSE/MAE of $0.066/0.161$ (versus PatchTST’s $0.074/0.177$), indicating 11%11\% and 9%9\% relative improvement. Hierarchical loss is defined by

L=h=1Kγh1t=1Hλty^(h1)H+ty(h1)H+t2\mathcal{L} = \sum_{h=1}^K \gamma^{h-1} \cdot \sum_{t=1}^H \lambda_t \|\hat{y}_{(h-1)H+t} - y_{(h-1)H+t}\|^2

with segment discount γ(0,1]\gamma\in(0,1], improving long-term prediction stability.

7. Limitations and Open Issues

Several limitations and avenues for future work exist:

  • Fixed Window Bound: Setting window size WW trades off context coverage against efficiency. Small WW may omit long-range dependencies.
  • Decay Parameter Sensitivity: Careful tuning of γ\gamma is required. Large values flatten the decay, reducing discriminative context; small values focus attention on immediate past, possibly ignoring useful historical patterns.
  • Streaming and Non-stationarity: The architecture assumes a fixed look-back LL and stationarity. Extending to streaming or non-stationary environments is an open research question.
  • Hierarchical Depth and Interpretability: Additional hierarchical layers may enhance expressivity, but their role in multi-scale pattern discovery merits further theoretical analysis.

In summary, AutoHFormer introduces a principled framework for strictly causal, efficient, and accurate long-horizon time-series forecasting by integrating hierarchical autoregressive modeling, dynamic windowed attention with decay, and adaptive temporal encodings (Zhang et al., 19 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AutoHFormer.