TimesFM Decoder-Only Backbone

Updated 9 February 2026

TimesFM Decoder-Only Backbone is a transformer-based architecture that leverages patch-based tokenization, causal self-attention, and autoregressive methods for unified forecasting.
It converts time-series data into non-overlapping patch embeddings enhanced by sinusoidal positional encoding, ensuring efficient context aggregation and computational feasibility.
Large-scale pretraining on diverse corpora enables zero-shot inference that rivals supervised models on key forecasting benchmarks.

A TimesFM decoder-only backbone is a transformer-based neural architecture for time-series forecasting that dispenses with the conventional encoder-decoder structure, relying solely on a deep, causal, multi-layer decoder. By leveraging large-scale pretraining on diverse time-series corpora, the backbone enables zero-shot inference that matches or exceeds specialized supervised models on multiple forecasting benchmarks. Its distinguishing feature is patch-based tokenization, causal self-attention without cross-attention, and a unified, autoregressive training and inference procedure, combined with architectural regularization strategies informed by recent theoretical and empirical analyses of decoder-only sequence models.

1. Architectural Foundations

TimesFM transforms a univariate or multivariate time series $y_{1:L}$ into a sequence of non-overlapping fixed-length patches $\tilde y_j \in \mathbb{R}^p$ , where each patch is embedded via a residual MLP:

$x_j = W_2\sigma(W_1\tilde y_j + b_1) + b_2 + W_\text{skip}\tilde y_j + b_\text{skip}, \quad x_j \in \mathbb{R}^d$

Each patch embedding $x_j$ receives a sinusoidal positional encoding $\mathrm{PE}_j$ (as in Vaswani et al., 2017), yielding input tokens $t_j = x_j + \mathrm{PE}_j$ . The resulting sequence $(t_1, ..., t_N)$ , with $N = \lfloor L/p \rfloor$ , is processed by a transformer stack comprising $n_\ell$ decoder blocks (typically $n_\ell = 20$ , $\tilde y_j \in \mathbb{R}^p$ 0, $\tilde y_j \in \mathbb{R}^p$ 1 heads).

Each block consists of masked multi-head self-attention, Add & LayerNorm, position-wise feed-forward (FFN), followed by another Add & LayerNorm stage:

$\tilde y_j \in \mathbb{R}^p$ 2

$\tilde y_j \in \mathbb{R}^p$ 3

$\tilde y_j \in \mathbb{R}^p$ 4

with application of a strict causal mask $\tilde y_j \in \mathbb{R}^p$ 5 to restrict attention to current and past tokens. The FFN has the form $\tilde y_j \in \mathbb{R}^p$ 6. LayerNorm follows per-feature normalization.

This backbone eschews the use of cross-attention mechanisms and functions exclusively in the decoder regime, implementing classic autoregressive modeling over patches.

2. Training Objective and Autoregressive Decoding

The TimesFM backbone is pretrained via a fully autoregressive, decoder-only approach:

For each output token $\tilde y_j \in \mathbb{R}^p$ 7 from the transformer, an OutputResidualBlock (MLP) maps to a forecasted data patch:

$\tilde y_j \in \mathbb{R}^p$ 8

The point prediction loss is patchwise mean squared error (MSE):

$\tilde y_j \in \mathbb{R}^p$ 9

Aggregated as

$x_j = W_2\sigma(W_1\tilde y_j + b_1) + b_2 + W_\text{skip}\tilde y_j + b_\text{skip}, \quad x_j \in \mathbb{R}^d$ 0

During training, a random offset $x_j = W_2\sigma(W_1\tilde y_j + b_1) + b_2 + W_\text{skip}\tilde y_j + b_\text{skip}, \quad x_j \in \mathbb{R}^d$ 1 is used to mask the first $x_j = W_2\sigma(W_1\tilde y_j + b_1) + b_2 + W_\text{skip}\tilde y_j + b_\text{skip}, \quad x_j \in \mathbb{R}^d$ 2 points, forcing exposure to all possible context lengths up to a maximum $x_j = W_2\sigma(W_1\tilde y_j + b_1) + b_2 + W_\text{skip}\tilde y_j + b_\text{skip}, \quad x_j \in \mathbb{R}^d$ 3.
During inference, input context is chunked, zero-padded, and autoregressively extended by predicting output patches until the desired forecast horizon $x_j = W_2\sigma(W_1\tilde y_j + b_1) + b_2 + W_\text{skip}\tilde y_j + b_\text{skip}, \quad x_j \in \mathbb{R}^d$ 4 is reached.

This patchwise-autoregressive framework confers flexibility in handling variable history and forecast lengths, and supports zero-shot transfer across time-series domains.

3. Attention Mechanisms and Regularization

Attention in the TimesFM backbone is strictly causal multi-head self-attention applied over patch tokens, with no encoder-decoder cross attention. Key regularization measures include:

Masking of missing data: Any fully masked patch is removed from attention via the padding mask to prevent information leakage and optimize memory use.
Feed-forward and normalization: The per-layer FFN and LayerNorm ensure activation stability and prevent exploding/vanishing gradients.

Analysis of decoder-only architectures (Fu et al., 2023) reveals the Attention Degeneration Problem, where sensitivity of the decoder to source inputs decays as generation progresses and the causal context window grows. This can erode the backbone’s ability to condition on distant inputs—critical for long-range forecasting or sequence-to-sequence transfer.

A practical recommendation, validated empirically, is to interleave partial source attention at every $x_j = W_2\sigma(W_1\tilde y_j + b_1) + b_2 + W_\text{skip}\tilde y_j + b_\text{skip}, \quad x_j \in \mathbb{R}^d$ 5th layer by holding a fixed source-projection $x_j = W_2\sigma(W_1\tilde y_j + b_1) + b_2 + W_\text{skip}\tilde y_j + b_\text{skip}, \quad x_j \in \mathbb{R}^d$ 6 and computing a partial attention $x_j = W_2\sigma(W_1\tilde y_j + b_1) + b_2 + W_\text{skip}\tilde y_j + b_\text{skip}, \quad x_j \in \mathbb{R}^d$ 7. Adoption of this PALM-style partial cross-attention can stabilize source sensitivity and reduce hallucination in generation tasks, though the base TimesFM implementation (Das et al., 2023) uses strictly decoder-only blocks.

4. Implementation Parameters and Computational Complexity

The backbone is instantiated at multiple model sizes (17M, 70M, 200M parameters). The primary configuration is:

Depth: 20 Transformer layers
Width: $x_j = W_2\sigma(W_1\tilde y_j + b_1) + b_2 + W_\text{skip}\tilde y_j + b_\text{skip}, \quad x_j \in \mathbb{R}^d$ 8
Attention heads: $x_j = W_2\sigma(W_1\tilde y_j + b_1) + b_2 + W_\text{skip}\tilde y_j + b_\text{skip}, \quad x_j \in \mathbb{R}^d$ 9 per layer
Attention cost: $x_j$ 0 per layer ( $x_j$ 1 with typical $x_j$ 2, $x_j$ 3)
Pretraining: $x_j$ 4100B time-points (Google Trends, Wikipedia traffic, synthetic ARMA+trend)
Training schedule: 1.5M steps, global batch size 4096, cosine LR decay from $x_j$ 5, Dropout 0.2, 16-core TPU v5e, $x_j$ 62 days
Inference: $x_j$ 7 per horizon

The compact, patch-based tokenization and aggressive aggregation of context into patches ensure high computational efficiency and feasible memory use even at larger temporal contexts.

5. Empirical Performance and Benchmarking

On the Monash Archive (18 out-of-sample datasets), TimesFM in zero-shot mode achieves a scaled MAE within statistical significance of the leading supervised neural model (N-BEATS), surpassing DeepAR, WaveNet, and the GPT-3–based llmtime. On the Darts collection (8 univariate benchmarks), it matches seasonal ARIMA and llmtime performance. On the ETT Long Horizon benchmarks, TimesFM zero-shot performance equals or exceeds PatchTST, FEDFormer, and Autoformer, all without fine-tuning (Das et al., 2023). These results indicate robust generalization and competitive accuracy across diverse granularities and data regimes.

6. Relation to Decoder-Only Models and Theoretical Insights

The decoder-only backbone follows the evolving paradigm—initiated in LLMs—of unifying input and output sequences into a single autoregressive stream operated by a deep transformer. Theoretical work (Fu et al., 2023) shows this regime can be formulated as a regularized encoder-decoder (RED), and details sensitivity decay as a core limitation. The PALM architecture augments vanilla decoder-only Transformers with a fixed-length partial attention mechanism at every layer, empirically shown to restore source sensitivity and arrest increases in hallucination or early-stop errors. For sequence-to-sequence tasks where rich, persistent conditioning on the source is required, integrating partial attention (as an architectural regularization) into TimesFM-style backbones is recommended.

Practical implementation choices—such as separate positional encodings for source and target, auxiliary source auto-encoding loss, layerwise coordination, and the judicious use of role embeddings—further stabilize decoder-only learning and can be integrated with the TimesFM backbone to match encoder-decoder baselines in controlled ablation studies (Fu et al., 2023).

7. Future Directions and Ongoing Research

Extending the TimesFM backbone involves exploiting richer attention regularization (e.g., PALM partial attention), scaling pretraining corpora to further domains, and integrating learnable input and output chunking adapted to variable time-series structure. Addressing the attention degeneration phenomenon is ongoing, with architectural motifs like fixed-query memory, role-aware embeddings, and hybrid encoder-decoder variants under investigation. Efficient inference for ultra-long horizons and robust handling of missing or irregular data streams remain open areas. The decoder-only backbone constitutes a central motif in the ongoing shift toward foundation models for time series and structured prediction, supported by both scaling success and emerging theoretical clarity (Das et al., 2023, Fu et al., 2023).

Markdown Report Issue Upgrade to Chat

References (2)

Decoder-Only or Encoder-Decoder? Interpreting Language Model as a Regularized Encoder-Decoder (2023)

A decoder-only foundation model for time-series forecasting (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TimesFM Decoder-Only Backbone.