Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-Task Temporal Fusion Transformer (TFT-MTL)

Updated 13 February 2026
  • Multi-Task TFT (TFT-MTL) is a deep learning framework that jointly forecasts sales, inventory turnover, and stockout probability using a unified encoder-decoder architecture.
  • It integrates static and dynamic features through embedding layers, GRNs, and temporal attention, offering interpretability via variable selection and attention attribution.
  • Empirical results show significant improvements over single-task models, demonstrating enhanced forecasting accuracy and operational decision support in e-commerce supply chains.

The Multi-Task Temporal Fusion Transformer (TFT-MTL) is a deep learning framework for jointly forecasting sales, inventory turnover, and stockout probability in large-scale e-commerce supply chains. Designed for the Amazon e-commerce ecosystem, TFT-MTL integrates multivariate, heterogeneous inputs and leverages a unified encoder-decoder architecture with explicit mechanisms for variable selection, temporal attention, and interpretable multi-task prediction. Empirical results demonstrate that TFT-MTL outperforms standard benchmarks for time series forecasting, substantiating the value of multi-task temporal modeling in operational decision support (Hu et al., 29 Nov 2025).

1. Data Inputs, Embeddings, and Preprocessing

TFT-MTL operates on a mix of static and time-varying features, each aligned to daily timestamps. Static covariates (XsX^s) such as product_id (ASIN), category, and brand are mapped via embedding layers and projected using a feed-forward transformation: hs=ReLU(WsXs+bs),hsRdsh^s = \mathrm{ReLU}\bigl(W^s X^s + b^s\bigr), \quad h^s \in \mathbb{R}^{d_s} Dynamic temporal features xtx_t include:

  • Targets: daily_sales, inventory_level
  • Pricing: price, discount_rate
  • Marketing: ad_spend, page_views
  • Events: is_holiday, day_of_week
  • Logistics: lead_time

Continuous variables are standardized, and day_of_week is one-hot encoded. The feature vector xtx_t is transformed by a Gated Residual Network (GRN): htd=GRN(xt)=LayerNorm(xt+GLU(Wxxt))h^{d}_t = \mathrm{GRN}(x_t) = \mathrm{LayerNorm} \bigl( x_t + \mathrm{GLU}(W^x x_t) \bigr) where GLU\mathrm{GLU} is a gated linear unit. This configuration enables nonlinear transformations and dynamic selection of relevant covariates across time.

2. Model Structure: Shared Encoder and Task-Specific Decoders

2.1 Shared Temporal Encoder

The encoder fuses both short- and long-range patterns by combining several architectural elements:

  • Variable Selection Network (VSN): At each time tt, a softmax-based gating layer assigns a relevance score αt\alpha_t to each input:

αt=Softmax(Wvhtd+bv)\alpha_t = \mathrm{Softmax} ( W^v h^d_t + b^v )

The input embed htdh^d_t is re-weighted:

h~td=αthtd\widetilde{h}^d_t = \alpha_t \odot h^d_t

  • Static Covariate Fusion: Static embeddings hsh^s are concatenated with every h~td\widetilde{h}^d_t for context-aware modeling.
  • Positional Encoding: Sinusoidal encodings are appended to h~td\widetilde{h}^d_t to encode temporal order.
  • Multi-Head Temporal Attention: The model applies self-attention to sequences of h~td\widetilde{h}^d_t over a look-back horizon ThistT_\mathrm{hist}:

Attentioni(Q,K,V)=Softmax(QKdk)V\mathrm{Attention}_i(Q,K,V) = \mathrm{Softmax}\left( \frac{Q K^\top}{\sqrt{d_k}} \right)V

The attended representations are further processed by GRNs for controlled residual integration.

2.2 Task-Specific Decoder Heads

The shared encoder output hth_t is input to three parallel heads:

  1. Sales Volume Head:

y^t(s)=W(s)ht+b(s)\hat{y}_t^{(s)} = W^{(s)}h_t + b^{(s)}

  1. Inventory Turnover Head:

y^t(i)=W(i)ht+b(i)\hat{y}_t^{(i)} = W^{(i)}h_t + b^{(i)}

  1. Stockout Probability Head:

p^t(p)=σ(W(p)ht+b(p))\hat{p}_t^{(p)} = \sigma\left( W^{(p)}h_t + b^{(p)} \right)

Each head can be augmented with an additional GRN for expressiveness, supporting both regression (sales, inventory) and probabilistic (stockout) targets.

3. Multi-Task Learning Objectives and Training

3.1 Joint Loss Formulation

TFT-MTL is trained by minimizing a weighted sum of task losses: L=αsLsales+αiLinventory+αpLstockout\mathcal{L} = \alpha_s\,\mathcal{L}_{\rm sales} + \alpha_i\,\mathcal{L}_{\rm inventory} + \alpha_p\,\mathcal{L}_{\rm stockout}

  • Sales loss (MSE):

Lsales=1Tpredt=1Tpred(y^t(s)yt(s))2\mathcal{L}_{\rm sales} = \frac{1}{T_{\rm pred}} \sum_{t=1}^{T_{\rm pred}} \bigl( \hat{y}_t^{(s)} - y_t^{(s)} \bigr)^2

  • Inventory loss (MSE):

Linventory=1Tpredt=1Tpred(y^t(i)yt(i))2\mathcal{L}_{\rm inventory} = \frac{1}{T_{\rm pred}} \sum_{t=1}^{T_{\rm pred}} \bigl( \hat{y}_t^{(i)} - y_t^{(i)} \bigr)^2

  • Stockout loss (Binary Cross-Entropy):

Lstockout=1Tpredt=1Tpred[ztlogp^t(p)+(1zt)log(1p^t(p))]\mathcal{L}_{\rm stockout} = -\frac{1}{T_{\rm pred}} \sum_{t=1}^{T_{\rm pred}} \bigl[ z_t \log \hat{p}_t^{(p)} + (1-z_t)\log(1-\hat{p}_t^{(p)}) \bigr]

Task weights (αs,αi,αp)(\alpha_s, \alpha_i, \alpha_p) are determined either via cross-validation or dynamic re-weighting based on gradient norms; in experiments, αp\alpha_p is often set lower to reflect the scale of BCE loss.

3.2 Training Regimen

TFT-MTL is trained with AdamW (learning rate 5×1045\times 10^{-4}, weight decay 10510^{-5}), batch size 64, forecast horizon Tpred=14T_\mathrm{pred}=14, look-back Thist=30T_\mathrm{hist}=30. Dropout (0.1) is applied in GRNs and attention layers. Learning rate annealing is triggered by plateaus in validation loss. Early stopping is employed based on validation loss, typically converging between epochs 120–140 of 150.

4. Empirical Performance and Comparative Results

Performance is evaluated on a held-out six-month test set using RMSE and MAPE:

RMSE=1Nt=1N(yty^t)2\mathrm{RMSE} = \sqrt{ \frac{1}{N} \sum_{t=1}^N (y_t - \hat{y}_t)^2 }

MAPE=100%Nt=1Nyty^tyt\mathrm{MAPE} = \frac{100\%}{N}\sum_{t=1}^N \left|\frac{y_t - \hat{y}_t}{y_t}\right|

Metric TFT-MTL Single-Task TFT LSTM
Sales RMSE 42.57 45.36 54.12
Sales MAPE (%) 8.68 9.94
Inventory RMSE 39.86 42.57
Inventory MAPE (%) 8.43 9.63
Multi-Task Efficiency 0.894 0.861 0.781

TFT-MTL achieves a 6.2% reduction in Sales RMSE, 12.7% reduction in Sales MAPE, 6.4% reduction in Inventory RMSE, and 12.4% reduction in Inventory MAPE relative to single-task TFT. The Multi-Task Efficiency Score (MTES) improvement over baselines confirms the advantage of joint modeling for multi-dimensional supply chain forecasting (Hu et al., 29 Nov 2025).

5. Interpretability and Decision Support

TFT-MTL inherits the interpretability features of the Temporal Fusion Transformer, with two principal mechanisms:

  • Variable Importance: The VSN gating weights αt,j\alpha_{t,j} quantify the relevance of each feature jj at time tt, enabling visualization as heatmaps to highlight salient periods or factors (e.g., discount_rate spikes during promotion events).
  • Attention Attribution: Attention matrices At+h,τA_{t+h,\tau} score the influence of each historical timestep τ\tau on predictions at t+ht+h:

At+h,τ=Softmax(qt+hkτ/dk)A_{t+h,\tau} = \mathrm{Softmax}\left(q_{t+h}^\top k_\tau / \sqrt{d_k}\right)

These interpretability elements deliver actionable insights for demand planning and inventory scheduling by quantifying both feature-level and temporal dependencies.

6. Application Context and Significance

The TFT-MTL framework is deployed for joint sales and inventory forecasting in Amazon’s e-commerce supply chain. By integrating a broad spectrum of covariates and modeling multivariate temporal dependencies, it acts as both a high-accuracy predictive engine and an interpretable decision-support system for inventory optimization, replenishment, and demand management (Hu et al., 29 Nov 2025). The observed empirical improvements over conventional single-task and sequential models suggest that multi-task temporal modeling is effective for complex, operationally critical forecasting scenarios in modern supply chains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Task TFT (TFT-MTL).