Autoformer: Forecasting & Vision Transformer NAS

Updated 18 January 2026

The Autoformer model integrates progressive series decomposition and auto-correlation attention to explicitly separate trend and seasonal components, enhancing forecast accuracy.
It efficiently captures long-range temporal dependencies using FFT-based lag selection and rolling aggregation, reducing computational cost compared to standard attention mechanisms.
Autoformer has been extended to spatio-temporal graph modeling and vision neural architecture search, demonstrating improved interpretability and robust performance across diverse applications.

Autoformer refers to a class of neural network architectures centered on time series forecasting and, independently, a neural architecture search framework for vision transformers. The foundational time series Autoformer model is distinguished by its integration of progressive series decomposition—explicitly separating trend and seasonal components—with a novel auto-correlation attention mechanism. These design choices confer interpretability, improved robustness to noise, and computational efficiency for forecasting long-range temporal dependencies. Autoformer’s architectural concept has catalyzed a range of extensions, notably in spatio-temporal graph modeling and multiscale traffic prediction.

1. Series Decomposition and Architecture

Autoformer’s encoder–decoder backbone fundamentally differs from conventional Transformers by embedding series decomposition as a native architectural block. At each layer, the input time series $\mathbf{x}_i\in\mathbb{R}^P$ is partitioned via moving-average filtering: $\mathbf{x}_i^{\mathrm{trend}} = \mathrm{MA}_k(\mathbf{x}_i),\quad \mathbf{x}_i^{\mathrm{seasonal}} = \mathbf{x}_i - \mathbf{x}_i^{\mathrm{trend}}$ where $\mathrm{MA}_k$ denotes a 1D average-pooling operation with kernel size $k$ (Forootani et al., 26 May 2025, Wu et al., 2021). This decomposition isolates the slowly-varying trend from more rapid seasonal fluctuations.

Architectural variants include:

Minimal: Single decomposition per layer, small kernel ( $k=3$ ), encoding only the seasonal part, summing the trend at output.
Standard: Deeper encoder, balanced trend/seasonal loss and tuned initialization.
Full: Sequence-to-sequence encoder/decoder, larger kernel ( $k=25$ ), decoder initialized to zero, cross-attention to encoder’s seasonal output, trend projected in parallel.

Recombination of seasonal and trend outputs delivers the final forecast: $\hat{\mathbf{y}}_i = W_o\,\mathrm{AvgPool}(\mathbf{h}_i) + W_t\,\mathbf{x}_i^{\mathrm{trend}}$ for Minimal and Standard, while Full uses decoded seasonal plus linearly projected trend (Forootani et al., 26 May 2025).

2. Auto-Correlation Attention Mechanism

Traditional attention mechanisms require $O(L^2)$ pairwise computations for a sequence length $L$ . Autoformer introduces Auto-Correlation—leveraging the Fourier domain to model time-delay dependencies and aggregate patterns at multiple lags: $R_{XX}(\tau)=\frac{1}{L}\sum_{t=1}^L X_t X_{t-\tau}$ computed efficiently via the Wiener–Khinchin theorem: $S_{XX}(f) = \mathcal{F}(X)\,\overline{\mathcal{F}(X)},\quad R_{XX}(\tau) = \mathcal{F}^{-1}(S_{XX}(f))$ Cross-correlation between projected queries and keys identifies top $k$ lags. Values are aggregated by rolling—shifting entire value series by selected lags, weighted by softmax-normalized scores: $\mathrm{AutoCorr}(\mathcal Q,\mathcal K,\mathcal V) = \sum_{i=1}^k \widehat{R}_i \mathrm{Roll}(\mathcal V,\tau_i)$ yielding $O(L\log L)$ time and space per head (Wu et al., 2021, Forootani et al., 26 May 2025). This mechanism preserves periodic structure, reduces over-smoothing, and increases modeling capacity for multihorizon forecasting.

3. Computational Complexity and Variants

The complexity per variant is as follows:

Minimal/Standard: $O(P^2d + P\,d^2)$
Full: $O(P^2 d + H^2 d + H P d + P d^2 + H d^2)$ for sequence-to-sequence forecasting with cross-attention

The decomposition block runs at $O(P)$ per layer, and cross-attention scales at $O(H P d)$ (Forootani et al., 26 May 2025). Increasing patch length $P$ or forecast horizon $H$ elevates resource demands, with Full supporting longer horizons at quadratic time cost.

4. Empirical Results in Synthetic and Real-World Domains

Autoformer exhibits robust performance on synthetic signals (sinusoidal, polynomial, modulated, exponential) over multiple patch lengths and horizons:

Clean regime: average RMSE $<$ 0.045, MAE $<$ 0.027
Noisy regime: average RMSE in [0.046, 0.076], MAE in [0.038, 0.059]
Outperforms Informer across all variants under noise (Forootani et al., 26 May 2025)

On six multivariate real-world datasets (energy, traffic, economics, weather, disease), original Autoformer achieves a mean 38% relative MSE reduction over previous Transformer methods, demonstrating superior stability for long-range forecasting ( $O$ up to 720 time steps) (Wu et al., 2021).

Autoformer’s repeated trend-seasonal decomposition acts as a built-in low-pass filter, enhancing noise robustness. The architecture displays optimal accuracy for moderate patch lengths ( $12 \leq P \leq 16$ ), mitigates overfitting at longer $P$ , and degrades gracefully for high $H$ in Full mode.

5. Extensions: Spatio-Temporal, Multiscale, and Explainable Autoformer

Autoformer has been embedded in graph neural networks for spatio-temporal wind and traffic forecasting:

Spatio-Temporal Autoformer (ST-Autoformer) within a GNN update achieves lowest MSE/MAE for 10 min and 1 h horizon forecasts relative to persistence, LSTM, MLP, Informer, LogSparse Transformer, and FFTransformer (Bentsen et al., 2022).
For longer horizons where trend components dominate, architectures with explicit trend modeling (FFTransformer) surpass Autoformer, suggesting the trend pathway is a bottleneck for very-long-term non-periodic tasks.

The Explainable Graph Pyramid Autoformer (X-GPA) augments Autoformer with patch-based attention pyramids and multi-scale autocorrelation FFT blocks, coupled with spatial graph attention:

Pyramid autocorrelation compresses long sequences to multiscale pseudo-timestamps, followed by FFT-based lag selection and roll aggregation at each scale (Zhong et al., 2022).
Spatial-temporal fusion provides both time-lag and node importance scores, yielding transparent ante-hoc explanations for traffic forecasts (e.g., congestion propagation, periodicity between weekdays and weekends).

6. AutoFormer in Vision: Neural Architecture Search

Independently, the AutoFormer framework refers to a transformer neural architecture search (NAS) system for visual recognition tasks (Chen et al., 2021):

Employs “weight entanglement” in a one-shot supernet, storing weights for the largest block in each layer; subnets inherit weights by slicing.
Search space defined over embedding dimension, $Q\!K\!V$ dim, MLP ratio, number of heads, and depth.
Evolutionary search discovers subnets whose inherited accuracy matches retrained performance.
AutoFormer-tiny/small/base models achieve 74.7%/81.7%/82.4% ImageNet top-1 accuracy with 5.7M/22.9M/53.7M parameters, surpassing contemporary methods (DeiT, ViT) at equivalent resource budgets.
Transfer learning and knowledge distillation further boost accuracy, and fine-tuning offers negligible gain over inherited weights.

7. Limitations and Design Trade-Offs

Autoformer’s trend-seasonal decomposition confers interpretability and noise robustness but may marginally sacrifice trend modeling for ultralong horizons. Minimal and Standard variants deliver near-equivalent short-horizon performance at lower computational cost; Full mode is reserved for highly non-stationary or long-horizon tasks. In vision, the AutoFormer NAS framework’s entanglement mechanism regularizes deep model optimization and enables efficient subnetwork selection, though the search space currently omits convolutional operations.

A plausible implication is that further augmentations (e.g., explicit operator-theoretic latent state modeling or multistream trend pathways) can extend Autoformer’s regime of stability and interpretability in complex, noisy temporal domains.