Multi-Task Temporal Fusion Transformer (TFT-MTL)

Updated 13 February 2026

Multi-Task TFT (TFT-MTL) is a deep learning framework that jointly forecasts sales, inventory turnover, and stockout probability using a unified encoder-decoder architecture.
It integrates static and dynamic features through embedding layers, GRNs, and temporal attention, offering interpretability via variable selection and attention attribution.
Empirical results show significant improvements over single-task models, demonstrating enhanced forecasting accuracy and operational decision support in e-commerce supply chains.

The Multi-Task Temporal Fusion Transformer (TFT-MTL) is a deep learning framework for jointly forecasting sales, inventory turnover, and stockout probability in large-scale e-commerce supply chains. Designed for the Amazon e-commerce ecosystem, TFT-MTL integrates multivariate, heterogeneous inputs and leverages a unified encoder-decoder architecture with explicit mechanisms for variable selection, temporal attention, and interpretable multi-task prediction. Empirical results demonstrate that TFT-MTL outperforms standard benchmarks for time series forecasting, substantiating the value of multi-task temporal modeling in operational decision support (Hu et al., 29 Nov 2025).

1. Data Inputs, Embeddings, and Preprocessing

TFT-MTL operates on a mix of static and time-varying features, each aligned to daily timestamps. Static covariates ( $X^s$ ) such as product_id (ASIN), category, and brand are mapped via embedding layers and projected using a feed-forward transformation: $h^s = \mathrm{ReLU}\bigl(W^s X^s + b^s\bigr), \quad h^s \in \mathbb{R}^{d_s}$ Dynamic temporal features $x_t$ include:

Targets: daily_sales, inventory_level
Pricing: price, discount_rate
Marketing: ad_spend, page_views
Events: is_holiday, day_of_week
Logistics: lead_time

Continuous variables are standardized, and day_of_week is one-hot encoded. The feature vector $x_t$ is transformed by a Gated Residual Network (GRN): $h^{d}_t = \mathrm{GRN}(x_t) = \mathrm{LayerNorm} \bigl( x_t + \mathrm{GLU}(W^x x_t) \bigr)$ where $\mathrm{GLU}$ is a gated linear unit. This configuration enables nonlinear transformations and dynamic selection of relevant covariates across time.

2. Model Structure: Shared Encoder and Task-Specific Decoders

2.1 Shared Temporal Encoder

The encoder fuses both short- and long-range patterns by combining several architectural elements:

Variable Selection Network (VSN): At each time $t$ , a softmax-based gating layer assigns a relevance score $\alpha_t$ to each input:

$\alpha_t = \mathrm{Softmax} ( W^v h^d_t + b^v )$

The input embed $h^d_t$ is re-weighted:

$\widetilde{h}^d_t = \alpha_t \odot h^d_t$

Static Covariate Fusion: Static embeddings $h^s$ are concatenated with every $\widetilde{h}^d_t$ for context-aware modeling.
Positional Encoding: Sinusoidal encodings are appended to $\widetilde{h}^d_t$ to encode temporal order.
Multi-Head Temporal Attention: The model applies self-attention to sequences of $\widetilde{h}^d_t$ over a look-back horizon $T_\mathrm{hist}$ :

$\mathrm{Attention}_i(Q,K,V) = \mathrm{Softmax}\left( \frac{Q K^\top}{\sqrt{d_k}} \right)V$

The attended representations are further processed by GRNs for controlled residual integration.

2.2 Task-Specific Decoder Heads

The shared encoder output $h_t$ is input to three parallel heads:

Sales Volume Head:

$\hat{y}_t^{(s)} = W^{(s)}h_t + b^{(s)}$

Inventory Turnover Head:

$\hat{y}_t^{(i)} = W^{(i)}h_t + b^{(i)}$

Stockout Probability Head:

$\hat{p}_t^{(p)} = \sigma\left( W^{(p)}h_t + b^{(p)} \right)$

Each head can be augmented with an additional GRN for expressiveness, supporting both regression (sales, inventory) and probabilistic (stockout) targets.

3. Multi-Task Learning Objectives and Training

3.1 Joint Loss Formulation

TFT-MTL is trained by minimizing a weighted sum of task losses: $\mathcal{L} = \alpha_s\,\mathcal{L}_{\rm sales} + \alpha_i\,\mathcal{L}_{\rm inventory} + \alpha_p\,\mathcal{L}_{\rm stockout}$

Sales loss (MSE):

$\mathcal{L}_{\rm sales} = \frac{1}{T_{\rm pred}} \sum_{t=1}^{T_{\rm pred}} \bigl( \hat{y}_t^{(s)} - y_t^{(s)} \bigr)^2$

Inventory loss (MSE):

$\mathcal{L}_{\rm inventory} = \frac{1}{T_{\rm pred}} \sum_{t=1}^{T_{\rm pred}} \bigl( \hat{y}_t^{(i)} - y_t^{(i)} \bigr)^2$

Stockout loss (Binary Cross-Entropy):

$\mathcal{L}_{\rm stockout} = -\frac{1}{T_{\rm pred}} \sum_{t=1}^{T_{\rm pred}} \bigl[ z_t \log \hat{p}_t^{(p)} + (1-z_t)\log(1-\hat{p}_t^{(p)}) \bigr]$

Task weights $(\alpha_s, \alpha_i, \alpha_p)$ are determined either via cross-validation or dynamic re-weighting based on gradient norms; in experiments, $\alpha_p$ is often set lower to reflect the scale of BCE loss.

3.2 Training Regimen

TFT-MTL is trained with AdamW (learning rate $5\times 10^{-4}$ , weight decay $10^{-5}$ ), batch size 64, forecast horizon $T_\mathrm{pred}=14$ , look-back $T_\mathrm{hist}=30$ . Dropout (0.1) is applied in GRNs and attention layers. Learning rate annealing is triggered by plateaus in validation loss. Early stopping is employed based on validation loss, typically converging between epochs 120–140 of 150.

4. Empirical Performance and Comparative Results

Performance is evaluated on a held-out six-month test set using RMSE and MAPE:

$\mathrm{RMSE} = \sqrt{ \frac{1}{N} \sum_{t=1}^N (y_t - \hat{y}_t)^2 }$

$\mathrm{MAPE} = \frac{100\%}{N}\sum_{t=1}^N \left|\frac{y_t - \hat{y}_t}{y_t}\right|$

Metric	TFT-MTL	Single-Task TFT	LSTM
Sales RMSE	42.57	45.36	54.12
Sales MAPE (%)	8.68	9.94	—
Inventory RMSE	39.86	42.57	—
Inventory MAPE (%)	8.43	9.63	—
Multi-Task Efficiency	0.894	0.861	0.781

TFT-MTL achieves a 6.2% reduction in Sales RMSE, 12.7% reduction in Sales MAPE, 6.4% reduction in Inventory RMSE, and 12.4% reduction in Inventory MAPE relative to single-task TFT. The Multi-Task Efficiency Score (MTES) improvement over baselines confirms the advantage of joint modeling for multi-dimensional supply chain forecasting (Hu et al., 29 Nov 2025).

5. Interpretability and Decision Support

TFT-MTL inherits the interpretability features of the Temporal Fusion Transformer, with two principal mechanisms:

Variable Importance: The VSN gating weights $\alpha_{t,j}$ quantify the relevance of each feature $j$ at time $t$ , enabling visualization as heatmaps to highlight salient periods or factors (e.g., discount_rate spikes during promotion events).
Attention Attribution: Attention matrices $A_{t+h,\tau}$ score the influence of each historical timestep $\tau$ on predictions at $t+h$ :

$A_{t+h,\tau} = \mathrm{Softmax}\left(q_{t+h}^\top k_\tau / \sqrt{d_k}\right)$

These interpretability elements deliver actionable insights for demand planning and inventory scheduling by quantifying both feature-level and temporal dependencies.

6. Application Context and Significance

The TFT-MTL framework is deployed for joint sales and inventory forecasting in Amazon’s e-commerce supply chain. By integrating a broad spectrum of covariates and modeling multivariate temporal dependencies, it acts as both a high-accuracy predictive engine and an interpretable decision-support system for inventory optimization, replenishment, and demand management (Hu et al., 29 Nov 2025). The observed empirical improvements over conventional single-task and sequential models suggest that multi-task temporal modeling is effective for complex, operationally critical forecasting scenarios in modern supply chains.

Markdown Report Issue Upgrade to Chat

References (1)

Multi-Task Temporal Fusion Transformer for Joint Sales and Inventory Forecasting in Amazon E-Commerce Supply Chain (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Task TFT (TFT-MTL).

Multi-Task Temporal Fusion Transformer (TFT-MTL)

1. Data Inputs, Embeddings, and Preprocessing

2. Model Structure: Shared Encoder and Task-Specific Decoders

2.1 Shared Temporal Encoder

2.2 Task-Specific Decoder Heads

3. Multi-Task Learning Objectives and Training

3.1 Joint Loss Formulation

3.2 Training Regimen

4. Empirical Performance and Comparative Results

5. Interpretability and Decision Support

6. Application Context and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Multi-Task Temporal Fusion Transformer (TFT-MTL)

1. Data Inputs, Embeddings, and Preprocessing

2. Model Structure: Shared Encoder and Task-Specific Decoders

2.1 Shared Temporal Encoder

2.2 Task-Specific Decoder Heads

3. Multi-Task Learning Objectives and Training

3.1 Joint Loss Formulation

3.2 Training Regimen

4. Empirical Performance and Comparative Results

5. Interpretability and Decision Support

6. Application Context and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research