Multi-Task Temporal Fusion Transformer (TFT-MTL)
- Multi-Task TFT (TFT-MTL) is a deep learning framework that jointly forecasts sales, inventory turnover, and stockout probability using a unified encoder-decoder architecture.
- It integrates static and dynamic features through embedding layers, GRNs, and temporal attention, offering interpretability via variable selection and attention attribution.
- Empirical results show significant improvements over single-task models, demonstrating enhanced forecasting accuracy and operational decision support in e-commerce supply chains.
The Multi-Task Temporal Fusion Transformer (TFT-MTL) is a deep learning framework for jointly forecasting sales, inventory turnover, and stockout probability in large-scale e-commerce supply chains. Designed for the Amazon e-commerce ecosystem, TFT-MTL integrates multivariate, heterogeneous inputs and leverages a unified encoder-decoder architecture with explicit mechanisms for variable selection, temporal attention, and interpretable multi-task prediction. Empirical results demonstrate that TFT-MTL outperforms standard benchmarks for time series forecasting, substantiating the value of multi-task temporal modeling in operational decision support (Hu et al., 29 Nov 2025).
1. Data Inputs, Embeddings, and Preprocessing
TFT-MTL operates on a mix of static and time-varying features, each aligned to daily timestamps. Static covariates () such as product_id (ASIN), category, and brand are mapped via embedding layers and projected using a feed-forward transformation: Dynamic temporal features include:
- Targets: daily_sales, inventory_level
- Pricing: price, discount_rate
- Marketing: ad_spend, page_views
- Events: is_holiday, day_of_week
- Logistics: lead_time
Continuous variables are standardized, and day_of_week is one-hot encoded. The feature vector is transformed by a Gated Residual Network (GRN): where is a gated linear unit. This configuration enables nonlinear transformations and dynamic selection of relevant covariates across time.
2. Model Structure: Shared Encoder and Task-Specific Decoders
2.1 Shared Temporal Encoder
The encoder fuses both short- and long-range patterns by combining several architectural elements:
- Variable Selection Network (VSN): At each time , a softmax-based gating layer assigns a relevance score to each input:
The input embed is re-weighted:
- Static Covariate Fusion: Static embeddings are concatenated with every for context-aware modeling.
- Positional Encoding: Sinusoidal encodings are appended to to encode temporal order.
- Multi-Head Temporal Attention: The model applies self-attention to sequences of over a look-back horizon :
The attended representations are further processed by GRNs for controlled residual integration.
2.2 Task-Specific Decoder Heads
The shared encoder output is input to three parallel heads:
- Sales Volume Head:
- Inventory Turnover Head:
- Stockout Probability Head:
Each head can be augmented with an additional GRN for expressiveness, supporting both regression (sales, inventory) and probabilistic (stockout) targets.
3. Multi-Task Learning Objectives and Training
3.1 Joint Loss Formulation
TFT-MTL is trained by minimizing a weighted sum of task losses:
- Sales loss (MSE):
- Inventory loss (MSE):
- Stockout loss (Binary Cross-Entropy):
Task weights are determined either via cross-validation or dynamic re-weighting based on gradient norms; in experiments, is often set lower to reflect the scale of BCE loss.
3.2 Training Regimen
TFT-MTL is trained with AdamW (learning rate , weight decay ), batch size 64, forecast horizon , look-back . Dropout (0.1) is applied in GRNs and attention layers. Learning rate annealing is triggered by plateaus in validation loss. Early stopping is employed based on validation loss, typically converging between epochs 120–140 of 150.
4. Empirical Performance and Comparative Results
Performance is evaluated on a held-out six-month test set using RMSE and MAPE:
| Metric | TFT-MTL | Single-Task TFT | LSTM |
|---|---|---|---|
| Sales RMSE | 42.57 | 45.36 | 54.12 |
| Sales MAPE (%) | 8.68 | 9.94 | — |
| Inventory RMSE | 39.86 | 42.57 | — |
| Inventory MAPE (%) | 8.43 | 9.63 | — |
| Multi-Task Efficiency | 0.894 | 0.861 | 0.781 |
TFT-MTL achieves a 6.2% reduction in Sales RMSE, 12.7% reduction in Sales MAPE, 6.4% reduction in Inventory RMSE, and 12.4% reduction in Inventory MAPE relative to single-task TFT. The Multi-Task Efficiency Score (MTES) improvement over baselines confirms the advantage of joint modeling for multi-dimensional supply chain forecasting (Hu et al., 29 Nov 2025).
5. Interpretability and Decision Support
TFT-MTL inherits the interpretability features of the Temporal Fusion Transformer, with two principal mechanisms:
- Variable Importance: The VSN gating weights quantify the relevance of each feature at time , enabling visualization as heatmaps to highlight salient periods or factors (e.g., discount_rate spikes during promotion events).
- Attention Attribution: Attention matrices score the influence of each historical timestep on predictions at :
These interpretability elements deliver actionable insights for demand planning and inventory scheduling by quantifying both feature-level and temporal dependencies.
6. Application Context and Significance
The TFT-MTL framework is deployed for joint sales and inventory forecasting in Amazon’s e-commerce supply chain. By integrating a broad spectrum of covariates and modeling multivariate temporal dependencies, it acts as both a high-accuracy predictive engine and an interpretable decision-support system for inventory optimization, replenishment, and demand management (Hu et al., 29 Nov 2025). The observed empirical improvements over conventional single-task and sequential models suggest that multi-task temporal modeling is effective for complex, operationally critical forecasting scenarios in modern supply chains.