TSM2: Dual Mixer Time Series Model

Updated 21 February 2026

TSM2 is a neural architecture for time series forecasting that couples selective state space models with dual mixing mechanisms for both temporal and channel dimensions.
It leverages efficient S6 blocks and dense, learnable skip connections to achieve linear time and space complexity, reducing latency and memory usage compared to Transformers.
Empirical evaluations demonstrate TSM2’s state-of-the-art performance in long-range forecasting and its versatility in multivariate classification and spatiotemporal applications.

Time Series MambaMixer (TSM2) denotes a class of neural architectures for time series modeling and forecasting, built by coupling selective state space models (SSMs) with specifically designed dual mixing mechanisms operating across both temporal (token) and channel (feature) axes. Unlike Transformer-based models, TSM2 achieves strictly linear time and space complexity, integrating data-dependent SSM kernels and hierarchical skip connections across layers. The design is substantiated by empirical results that consistently place TSM2 at or above the state-of-the-art for long-range time series forecasting and, under various extensions and naming conventions, in multivariate classification and spatiotemporal domains (Behrouz et al., 2024, Chen et al., 17 Aug 2025, Ma et al., 2024, Ahamed et al., 2024).

1. Core Architecture and Dual Mixer Design

TSM2 is an SSM-based backbone purpose-built for univariate and multivariate time-series forecasting, attention- and MLP-free, and constructed by stacking $L$ identical "TSM2 blocks." Each block implements two core modules in sequence:

Selective Time Mixer: A unidirectional, input-dependent SSM (S6 block) processing along the time axis, enforcing future-blind (causal) modeling.
Selective Variate Mixer: A bidirectional S6 block operating along the feature/variate (channel) axis, allowing both forward and backward dependency modeling across features, unrestricted by causality.

Additionally, every block features learnable, weighted-averaging skip connections fusing all preceding block outputs at each step. This dense, parameterized skip connectivity enables deep feature reuse and improved gradient flow.

Input to TSM2 consists of historical values $x \in \mathbb{R}^{M \times T}$ , optionally concatenated with static features $S$ and future time-varying features $Z$ after linear projection and channel mixing. The backbone iteratively processes representations via the dual mixers, with the output head performing 2D normalization and a linear projection to the desired forecasting horizon $H$ .

This dual mixing structure extends simple SSMs and counters longstanding limitations of previous architectures, which either ignored cross-dimension communication or imposed fixed, data-independent SSM kernels (Behrouz et al., 2024).

2. State Space Model Parameterization and Efficient Scan

The selective SSM kernel in each mixer is parameterized by input-dependent functions: $\bar{B}_t = \mathrm{Linear}_B(u_t),\quad \bar{C}_t = \mathrm{Linear}_C(u_t),\quad \Delta_t = \mathrm{Softplus}(\mathrm{Linear}_\Delta(u_t))$ With a classic discretization: $h_t = \bar{A} h_{t-1} + \bar{B}_t u_t\ ,\quad y_t = \bar{C}_t h_t\,,$ where $\bar{A}$ , $\bar{B}$ , and $\bar{C}$ are dynamically adapted at each step. The efficient associative scan algorithm of S6 is leveraged for linear runtime with $x \in \mathbb{R}^{M \times T}$ 0 parallel scan depth on hardware (Behrouz et al., 2024). This approach enables input-content modulation, supporting highly data-dependent temporally and channel-selective kernels.

Both time and channel mixers follow this paradigm, outputting representations gated via element-wise nonlinearities, merged with MLP outputs, and integrated with skip-weighted aggregates.

3. Dual Selection Mechanism and Connectivity

Inside each TSM2 block, the twin selection process comprises:

Selective Time Mixing: Applies a unidirectional S6 with convolutional embedding and MLP gating on time-unfolded representations, producing $x \in \mathbb{R}^{M \times T}$ 1.
Selective Variate Mixing: Applies a bidirectional S6 across features/variates, producing $x \in \mathbb{R}^{M \times T}$ 2 by combining forward and reversed scans, incorporating MLP branch and convolutional embedding on the transposed inputs.

Input to each mixer is a weighted sum of all previous block outputs, governed by unique learnable weights. This skip-weighting ensures all TSM2 blocks—at every depth—may re-utilize both input and all intermediate abstract features, fundamentally differing from strictly sequential blocks in standard deep learning models (Behrouz et al., 2024).

4. Computational Complexity and Scaling

TSM2 inherits and extends SSM's computational efficiency. Let $x \in \mathbb{R}^{M \times T}$ 3 = batch size, $x \in \mathbb{R}^{M \times T}$ 4 = input (concatenated) sequence length ( $x \in \mathbb{R}^{M \times T}$ 5), $x \in \mathbb{R}^{M \times T}$ 6 = number of variates, $x \in \mathbb{R}^{M \times T}$ 7 = expansion dimension (typically $x \in \mathbb{R}^{M \times T}$ 8):

Component	Time Complexity	Space Complexity
Selective Time Mixer (S6, uni)	$x \in \mathbb{R}^{M \times T}$ 9
Selective Variate Mixer (S6, bi)	$S$ 0
Per block total	$S$ 1	$S$ 2
Transformer (for reference)	$S$ 3	$S$ 4

Crucially, TSM2 possesses strictly linear scaling in both sequence length and number of variates, in contrast to the quadratic complexity of Transformers and MLP-Mixer token-mixing, without sacrificing data dependency (Behrouz et al., 2024). Empirically, this yields 2–4 $S$ 5 reductions in inference latency and activation memory versus comparable Transformer models.

5. Empirical Performance and Ablations

TSM2 demonstrates benchmark-leading performance across a range of standard long-term forecasting datasets (ETTh1/2, ETTm1/2, Electricity, Exchange, Traffic, Weather), consistently outperforming Transformer-based, MLP-Mixer, and previous SSM models.

Selected MSE scores (lower is better):

Dataset	Horizon	TSM2	Best Competing Model	Second-best
ETTh1	96	0.375	SAMFormer (0.381)	FEDFormer (0.376)
Traffic	720	0.449	SAMFormer (0.456)	Transformer (0.468)

TSM2 achieves the best result in 26 of 32 horizon-dataset pairs, and second best in 5 (Behrouz et al., 2024).

Ablation studies show that (a) replacing the Selective Channel Mixer with an MLP (TSM2-MLP), or (b) removing the Time S6 (Mamba+LinearTime), both result in substantial accuracy drops. This verifies the necessity of dual S6 mixing and skip-weighting for optimal prediction. For instance, TSM2 achieves 0.375 MSE on ETTh1 (H=96), versus 0.386 (TSM2-MLP) and 0.388 (Mamba+LinearTime).

In settings with auxiliary static or temporal features, such as the M5-competition WRMSSE benchmark, TSM2 surpasses TSMixer by 7–8% on out-of-sample evaluation.

TSM2 serves as a foundational design for subsequent research on spatiotemporal modeling and scalable classification. STM2/STM3 (Chen et al., 17 Aug 2025) applies Multiscale Mamba modules with hierarchical graph-based aggregation, mixture-of-experts with node-embedding-based smooth routing, and scale-disentangled contrastive learning for spatiotemporal forecasting with theoretical guarantees.

In the classification regime, the TSCMamba architecture (Ahamed et al., 2024) (sometimes called TSM2 in the literature) extends the TSM2 backbone with multi-view spectral and temporal feature fusion, shift-equivariant CWT-based spectral embeddings, switch-gated view selection, and a tango scanning SSM scheme—that is, dual forward-reverse Mamba passes for inversion-invariant representations, enhancing generalization for time series classification tasks.

Variants such as DC-Mamber (Fan et al., 6 Jul 2025) and TSMamba (Ma et al., 2024) combine Mamba-type SSMs with channel-mixing attention or hybrid channel-mixing and channel-independent streams but typically retain higher complexity or limited cross-channel interaction unless task-specific heads are introduced.

7. Practical Implementation, Training, and Applications

TSM2 is trained end-to-end using objectives suitable for forecasting (e.g., MSE or WRMSSE), with cross-entropy for classification in extensions. Model capacities, block counts, and convolutional parameters are selected via grid search per dataset; normalization follows 2D batch or layer normalization across time and variate axes.

TSM2 adapts to auxiliary static and future features through parallel linear projections integrated in preprocessing. Input sequence lengths, forecast horizons, and batch sizes are adapted per task (e.g., $S$ 6, $S$ 7, batch size $S$ 8– $S$ 9), and optimization relies on AdamW with weight decay and learning rate near $Z$ 0.

TSM2's architecture is universally applicable to univariate and multivariate forecasting, with empirical validation spanning financial, weather, energy demand, and traffic datasets. The absence of attention or MLPs in the backbone, combined with the dual selective mechanism, positions TSM2 as both a reference architecture and a compelling backbone for pretraining, transfer learning, and downstream time series tasks (Behrouz et al., 2024, Ma et al., 2024, Chen et al., 17 Aug 2025, Ahamed et al., 2024).

References

"MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection" (Behrouz et al., 2024)
"STM3: Mixture of Multiscale Mamba for Long-Term Spatio-Temporal Time-Series Prediction" (Chen et al., 17 Aug 2025)
"A Mamba Foundation Model for Time Series Forecasting" (Ma et al., 2024)
"TSCMamba: Mamba Meets Multi-View Learning for Time Series Classification" (Ahamed et al., 2024)
"DC-Mamber: A Dual Channel Prediction Model based on Mamba and Linear Transformer for Multivariate Time Series Forecasting" (Fan et al., 6 Jul 2025)