GiftEval Benchmark for Forecasting
- GiftEval is a comprehensive benchmark that standardizes evaluation of zero-shot, multi-horizon time series forecasting across 23 real-world datasets spanning seven domains.
- It employs strict protocols—including clear train/test splits and sliding window evaluations—to ensure fair, reproducible comparisons using metrics like MAPE, CRPS, and MASE.
- The benchmark facilitates fair comparisons between statistical, deep, and foundation models, driving innovation in robust forecasting techniques and error diagnostics.
GiftEval is a large-scale, open benchmark designed to standardize and accelerate the evaluation of general-purpose time series forecasting models, with particular emphasis on zero-shot, multi-horizon, and heterogeneous-domain forecasting. Developed by Aksu et al. (2024), GiftEval targets both statistical and deep/foundation models and explicitly supports zero-shot generalization—evaluation protocols where models must forecast out-of-distribution series without any exposure to benchmark datasets during training or pretraining (Aksu et al., 2024). Its diversity, strict controls on data leakage, and breadth of included domains make it one of the most challenging and broadly adopted public evaluation suites in time series model research.
1. Benchmark Composition and Structure
GiftEval comprises 23 real-world datasets spanning seven domains: Nature, Web/CloudOps, Sales, Energy, Transport, Healthcare, and Economics/Finance. The datasets include approximately 144,000 univariate and multivariate time series, collectively covering more than 175 million observations and ten sampling frequencies, from 5-second to annual intervals (Aksu et al., 2024). Series vary in length from 50 to 14,000, reflecting realistic deployment heterogeneity.
Each dataset is divided into up to three forecasting tasks, defined by prediction horizon:
- Short horizon: Typically 8–60 time steps, tailored by frequency (e.g., 48 for hourly, 30 for daily, 12 for monthly).
- Medium horizon: Where available, 480–600 steps (e.g., 480 for hourly).
- Long horizon: 720–900 steps (e.g., 720 for hourly, 900 for 10 s).
This yields 97 distinct (dataset, frequency, horizon) configurations, each treated as a separate zero-shot task. To guarantee a strict zero-leakage setting, any dataset used in pretraining is explicitly excluded from the evaluation split. The benchmark also offers a separate pretraining set (4.5 M series, 230 B points), curated to prevent contamination between model development and evaluation (Aksu et al., 2024).
A table summarizing domain composition:
| Domain | Datasets | Series | Observations |
|---|---|---|---|
| Econ/Fin | M4 | ~100,000 | 25.3M |
| Energy | 3 | 2,036 | 74.1M |
| Web/CloudOps | 6 | 3,524 | 16.6M |
| Sales | 2 | 3,717 | 0.67M |
| Transport | 3 | 1,341 | 38.0M |
| Nature | 6 | 32,618 | 3.15M |
| Healthcare | 3 | 1,036 | 0.13M |
2. Evaluation Protocols and Metrics
The benchmark enforces strict protocols to ensure comparability and fairness:
- Data splits: Each series is partitioned, reserving the final 10% for test, with the preceding window as validation for hyperparameter tuning.
- Windowing: Sliding, non-overlapping windows according to each task’s prediction length; every window yields a forecast-error pair.
Primary evaluation metrics are:
- Median Absolute Percentage Error (MAPE):
Aggregated as the median over all forecast windows and normalized against the Seasonal Naive baseline.
with and quantile levels . All scores are reported as ratios to the Seasonal Naive forecaster's metric values.
- MASE (Mean Absolute Scaled Error): As employed in downstream studies (Auer et al., 29 May 2025, Oreshkin et al., 2 Jan 2026), MASE scales absolute error with respect to that of the seasonal naive:
where is the seasonal period.
- Scaled CRPS ("sCRPS", Editor's term): Ratio version of CRPS used for geometric averaging and model comparison, normalized by (Oreshkin et al., 2 Jan 2026).
- Aggregation & ranking: Each model receives a rank averaged over the 97 settings, circumventing domination by any single dataset.
3. Baseline Models and Leaderboard
GiftEval evaluates a broad suite of statistical, deep learning, and foundation models under uniform, zero-shot protocols:
- Statistical (5 models): Naive, Seasonal Naive, AutoETS, AutoARIMA, AutoTheta; all are fitted per series using the statsforecast framework.
- Deep Learning (8 models): DeepAR, TFT, TiDE, N-BEATS, PatchTST, DLinear, Crossformer, iTransformer; probabilistic heads configured via GluonTS.
- Foundation (4 models): TimesFM, Chronos-T5, Moirai (the only multivariate-capable foundation model), VisionTS.
Model context lengths, patch sizes, and multivariate supports are selected to match each task’s characteristics. For each configuration, models are compared by normalized CRPS and MAPE ranks. PatchTST and Moirai-large achieve top overall performance, with PatchTST generally excelling at high-frequency and longer-horizon tasks, while foundation models (notably Moirai) dominate on low-frequency and domain-specific long-term prediction (Aksu et al., 2024).
A table of model leaders across axes:
| Horizon | Best | 2nd Best |
|---|---|---|
| Short | PatchTST | Chronos-base |
| Medium | PatchTST | Chronos-base |
| Long | PatchTST | Chronos-base |
| Domain | Best | 2nd Best |
|---|---|---|
| Energy | Moirai-large | Chronos-base |
| Web/CloudOps | PatchTST | TFT |
| Econ/Fin | Moirai-large | Moirai-base |
4. Zero-Shot Generalization and Simulator-Based Protocols
GiftEval explicitly supports zero-shot evaluations, preventing any exposure of model weights to evaluation data. Recent work (Oreshkin et al., 2 Jan 2026) demonstrates that models trained exclusively on synthetic series generated by SARIMA-based simulators can achieve strong zero-shot generalization on GiftEval. In particular, SarSim0—a fast, on-the-fly SARIMA generator equipped with stability-region sampling, multi-seasonality superposition, and heavy-tailed noise models—enables foundation and neural models (e.g., N-BEATS, PatchTST, Chronos) to outperform even the statistical AutoARIMA processes responsible for generating the synthetic training data, a phenomenon characterized as "student-beats-teacher" (Oreshkin et al., 2 Jan 2026). This suggests that structural diversity and scale in simulation can substitute for real-world data in pretraining, at least for univariate cases.
Aggregated (sCRPS, MASE) results from simulator-trained models:
| Model | sCRPS | MASE | Strict ZS? |
|---|---|---|---|
| PatchTST-SarSim0 | 0.573 | 0.837 | ✔ |
| N-BEATS-SarSim0 | 0.602 | 0.849 | ✔ |
| AutoARIMA (baseline) | 0.912 | 1.074 | ✘ |
5. Methodological Insights and Error Diagnostics
GiftEval’s heterogeneous nature exposes model vulnerabilities in both quantitative metrics and qualitative behaviors. The benchmark reveals that:
- Foundation models tend to excel in domains exhibiting strong trend or periodicity (low entropy)—energy, economics, low-frequency long-term forecasting; they degrade on high-frequency, high-entropy data (transport, WebOps).
- Deep learning models (PatchTST, TFT) are superior at high-frequency, volatile, or noisy regimes, particularly for short and medium horizons.
- Models that rely purely on patch-based or transformer architectures can suffer from quantile collapse and error accumulation over long horizons, motivating hybrid architectures (e.g., sLSTM-based TiRex (Auer et al., 29 May 2025)).
- State-tracking and contiguous patch masking in recurrent models are critical for maintaining coherent uncertainty propagation, as shown by TiRex’s ablations on GiftEval: disabling patch masking or state tracking degrades long-horizon CRPS by up to 14% (Auer et al., 29 May 2025).
Representative error analysis from TiRex and other baselines identified typical failure modes:
- Quantile collapse: foundation models can underestimate uncertainty, resulting in too-narrow prediction bands on long-range forecasts.
- Spike miss: models not trained with rare-event augmentations may smooth away or entirely miss sharp transients typical in electrical load data.
6. Practical Usage, Reproducibility, and Leaderboard
GiftEval provides a standard codebase supporting data download (Arrow format), train/test split replication, and wrapper interfaces for all benchmarked models. Preprocessing scripts, sliding window evaluators, and leaderboard generation tools are integrated for reproducibility (Aksu et al., 2024). Strict guidelines ensure fair comparison, including context-length matching and patch-size heuristics by task frequency. Practitioners are encouraged to verify strict zero-leakage when adapting their own models and to adopt the aggregation methodology to avoid dataset-dominated rankings.
Official resources, including data and code, are available at https://github.com/SalesforceAIResearch/gift-eval (Aksu et al., 2024).
7. Emerging Themes and Future Directions
The breadth of GiftEval enables in-depth study of scaling laws, model architecture specialization, and synthetic-to-real generalization. Empirical evidence suggests that anticipated scaling benefits (larger foundation models outperforming smaller ones) only partially apply; performance gains are not uniform across domains or frequencies. Recommendations from benchmark analyses include:
- Pretraining on higher-frequency, noisier series to improve performance on volatile targets.
- Incorporation of long-tail and high-entropy series in pretraining datasets.
- Development of hybrid models integrating patch-based, recurrent, and simulation-informed components to robustly handle diverse real-world stationarities and event structures (Aksu et al., 2024, Auer et al., 29 May 2025, Oreshkin et al., 2 Jan 2026).
GiftEval’s rapidly evolving leaderboard and ongoing synthetic protocol studies position it as a pivotal reference for evaluating and advancing the next generation of general-purpose and zero-shot time series forecasting models.