Papers
Topics
Authors
Recent
Search
2000 character limit reached

TSMixer Architecture for Time Series Forecasting

Updated 12 February 2026
  • TSMixer architecture is a neural framework for time series forecasting that alternates MLP-based temporal and feature mixing to capture complex dependencies.
  • It achieves robust performance and computational efficiency through reversible normalization, residual connections, and streamlined mixing layers.
  • Variants like TLN and TSKANMixer extend the design to enable linear interpretability, adaptive handling of irregular data, and universal function approximation.

The TSMixer architecture is a family of neural network models for time series forecasting, distinguished by the strategic use of multi-layer perceptrons (MLPs) to alternately mix information along temporal and feature axes. It emerged in response to findings that simple linear models often rival transformer-based or recurrent architectures on standard multivariate forecasting benchmarks. TSMixer achieves robust performance, computational efficiency, and architectural simplicity, while enabling the modeling of complex temporal and cross-variate dependencies. Variants include fully linear adaptations, hybrid and hierarchical models, adaptations for irregular time series, and extensions with universal function approximators such as Kolmogorov–Arnold Network (KAN) layers.

1. Core Structure and Data Flow

TSMixer accepts input tensors XRB×S×FX\in\mathbb{R}^{B\times S\times F}, where BB is the batch size, SS the sequence length, and FF the number of features or channels. The canonical processing pipeline includes:

  • Reversible Instance Normalization (RIN): Input series are normalized along the temporal dimension per feature, allowing for invertibility post-prediction (Genet et al., 2024).
  • Stacked Mixer Blocks: The main module comprises LL identical blocks, each alternating:
    • Temporal Mixing: Linear layer (typically a dense transformation) along the time axis for each feature, e.g., YT=WtXT+btY_T= W_t X_T + b_t with WtRS×SW_t\in\mathbb{R}^{S\times S}. The tensor is permuted such that time is the last dimension.
    • Feature Mixing: Performed via one or more fully connected layers along the feature axis, optionally with nonlinearity (e.g., ReLU). In the canonical version, this is Z=g(YWf+bf)Wf+bfZ = g(Y W_f + b_f) W_f' + b_f', with gg typically ReLU (Chen et al., 2023).
    • Residual and Layer Normalization: The result ZZ is added to YY, then normalized, yielding X()=LayerNorm(Y+Z)X^{(\ell)} = \mathrm{LayerNorm}(Y + Z).
  • Forecast Head: A dense aggregation layer projects the [B,S,F][B,S,F] output to the required target shape, depending on the forecasting horizon and number of outputs.

The typical data flow is summarized in the following pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def forward_TSMixer(X):  # X: [B, S, F]
    X_norm = RIN(X)
    for l in range(L):
        # Temporal mixing
        X_T = permute(X_norm, (0, 2, 1))  # [B, F, S]
        Y_T = dense_time(X_T)             # [B, F, S]
        Y = permute(Y_T, (0, 2, 1))       # [B, S, F]

        # Feature mixing
        Z = dense_feat1(Y)                # [B, S, D]
        Z = g(Z)                          # [B, S, D], g=ReLU or identity
        Z = dense_feat2(Z)                # [B, S, F]

        # Residual and normalization
        X_norm = LayerNorm(Y + Z)         # [B, S, F]
    return head(X_norm)
(Genet et al., 2024, Chen et al., 2023)

2. Mixing Mechanisms: Temporal and Feature Axes

Each Mixer Block contains two principal operations:

  • Temporal Mixing
    • Operates per-feature: for each variable, all its timesteps are transformed via a shared MLP or dense linear map. Formally, WtRS×SW_t \in \mathbb{R}^{S\times S} acts on the sequence axis, capturing lagged correlations and temporal patterns.
  • Feature Mixing
    • Operates per-timestep: for each timepoint, all features are mixed via a two-layer MLP with an intermediate expansion, paralleling the mixer block designs in MLP-Mixer for vision.
  • Nonlinearity
  • Residual Connections and Normalization

This alternation enables TSMixer to interleave deep temporal representations with cross-variate dependencies, a mechanism that substitutes the role of self-attention or convolution without imposing quadratic scaling costs.

3. Architectural Variants and Key Design Choices

A range of important design and implementation choices are explored across TSMixer papers:

  • Nonlinearity Removal: Eliminating activation functions (e.g., ReLU) yields “TSMixer-no-relu” or the TLN model. When composed only of linear layers, TSMixer becomes strictly linear in its inputs, allowing for interpretability and inference speedups by collapsing the entire model into a single equivalent linear operator (Genet et al., 2024).
  • Kernel Initializations: Standard Xavier or Kaiming initializations are used for dense layers, with small hidden sizes (feature-mixing width D5D\sim5 in linear variants) (Genet et al., 2024).
  • Patching and Embedding: Input patching—splitting long series into fixed-size segments—is sometimes used in TSMixer and its extensions (e.g., PatchTSMixer) to facilitate efficient long-range modeling and resolution adaptation (Ekambaram et al., 2023, Ekambaram et al., 2024).
  • Auxiliary Features and Conditional Blocks: The TSMixer-Ext pipeline incorporates static and future time-varying features by aligning inputs, concatenating embeddings, and applying conditional mixer blocks, extending multivariate context (Chen et al., 2023).
  • Cross-Channel and Online Reconciliation: Heads for hierarchical and cross-channel reconciliation can be appended to the backbone, explicitly enforcing constraints on forecast sums or local channel correlations (Ekambaram et al., 2023).

4. Extensions, Adaptations, and Linear Equivalence

TSMixer has been adapted for special requirements and extended beyond its original scope:

  • Temporal Linear Network (TLN): A strictly linear variant, TLN, is constructed by setting activation functions to identity and optionally adding dilated convolutions in the mixing stages. TLN satisfies f(x+cy)=f(x)+cf(y)f(x + cy) = f(x) + cf(y) for all x,yx, y and cc, allowing its entire mapping to be represented as a single affine operator: Y=WeqX+beqY = W_{\text{eq}} X + b_{\text{eq}}, where WeqW_{\text{eq}} can be computed by running the model on a basis of one-hot inputs plus the inference-time bias (Genet et al., 2024).

| Model | Nonlinearity | Linear Collapsibility | Temporal Order | Specialized Inits | Dilated Conv | |------------|-----------------|----------------------|----------------|------------------|--------------| | TSMixer | ReLU | No | Preserved | Yes | No | | TLN | None | Yes | Preserved | Yes | Optional |

This facilitates interpretable forecasting and efficient inference, since after training the entire network can be replaced by a single matrix-vector multiplication.

  • Tiny Time Mixers (TTMs): Pretrained, resource-efficient variants based on TSMixer incorporate adaptive patching, diverse resolution handling, and an exogenous mixer for fine-tuning multivariate and exogenous channel relationships (Ekambaram et al., 2024).
  • Irregular Data and Hierarchical Modeling: Extensions such as IMTS-Mixer transform irregularly sampled data to fixed-size matrices and apply sequence/channel mixing analogous to TSMixer (Klötergens et al., 17 Feb 2025).
  • Hybrid and Nonlinear Schemes: TSKANMixer replaces some or all MLP sublayers with Kolmogorov–Arnold Network (KAN) spline-based operators, providing universal function approximation and improved expressivity at the cost of computational speed (Hong et al., 25 Feb 2025).

5. Computational Complexity and Training Considerations

TSMixer’s complexity is largely determined by the number of blocks LL, sequence length SS, and feature dimension FF:

  • Parameter Growth: O(L(S2+FD))O(L(S^2 + F D)) for the basic model, linear in both SS and FF (Chen et al., 2023).
  • Efficiency: The MLP-only backbone provides significantly reduced computational and memory requirements compared to attention-based methods, yielding 2–4× faster inference and an order of magnitude fewer parameters on large datasets (e.g., M5) (Chen et al., 2023, Ekambaram et al., 2023).
  • Training Regimes: Losses include Mean Squared Error for continuous targets or likelihood-based objectives (e.g., Negative Binomial) for count data. Optimization is typically performed via Adam, with normalization and early stopping as regularization (Chen et al., 2023).
  • Inference: For strictly linear instantiations (TLN/no-ReLU), post-training inference can be collapsed into a single matrix multiplication, accelerating deployment and facilitating direct analysis of the contribution of each input time-feature pair (Genet et al., 2024).

6. Empirical Performance and Analyses

TSMixer and its descendants have established new baselines on a variety of standard benchmarks:

  • Long-Term Forecasting: On ETT, Electricity, Traffic, and similar datasets, TSMixer matches or outperforms transformer-based and univariate linear baselines, with up to 62% lower MSE than existing multivariate transformer variants (Chen et al., 2023).
  • Covariate Efficacy: On datasets with informative auxiliary and cross-series features (e.g., M5 retail), the multivariate TSMixer outperforms state-of-the-art univariate and transformer models by 25–30% (WRMSSE), closing the gap with deeply designed architectures such as DeepAR and TFT while using an order of magnitude fewer parameters (Chen et al., 2023).
  • Ablation Studies: On academic datasets with weak cross-covariate signals, the temporal mixing component alone suffices for strong performance, supporting the hypothesis that temporal modeling dominates on such benchmarks (Chen et al., 2023).
  • Extensions: Adding KAN layers (TSKANMixer) further improves out-of-sample accuracy by 10–30%, particularly for highly nonlinear target series, albeit at increased computational cost (Hong et al., 25 Feb 2025).

7. Theoretical Properties and Interpretability

The architectural simplicity of TSMixer enables key theoretical and practical advantages:

  • Analytic Linearity (TLN): For no-activation TSMixer variants, the full network’s action can be written as Y=WeqX+beqY = W_{\text{eq}} X + b_{\text{eq}}, giving complete interpretability and analyzability. This property enables direct diagnostics (e.g., via weight heatmaps) of how each input coordinate drives the forecast (Genet et al., 2024).
  • Temporal Structure Preservation: In contrast to permutation-invariant transformer models, TSMixer explicitly maintains time order throughout its layers, preserving the unique structure of time series data (Genet et al., 2024).
  • Universality: With KAN augmentations, theoretical universality is inherited from the Kolmogorov–Arnold theorem, allowing modeling of arbitrary continuous functions over input histories (Hong et al., 25 Feb 2025).
  • Modular Adaptation: Modular design allows for adaptation to hierarchical, irregular, pretraining, and multiresolution forecasting tasks.

TSMixer has redefined the role of MLP-based architectures in modern time series forecasting by combining efficient deep feedforward mixing principles with rigorous temporal modeling and extensibility—challenging transformer dominance, advancing scalability, and enabling interpretability across varied application domains (Chen et al., 2023, Ekambaram et al., 2023, Genet et al., 2024, Ekambaram et al., 2024, Hong et al., 25 Feb 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TSMixer Architecture.