What-If TSF: Scenario-Guided Forecasting
- What-If TSF is a conditional time series forecasting paradigm that integrates numerical history with structured text scenarios for counterfactual predictions.
- It curates 5,352 samples across domains like politics, energy, and finance to rigorously test models with multimodal, scenario-guided inputs.
- The methodology employs cross-attention fusion and causal interventions like activation transplantation to significantly enhance directional forecast accuracy.
What-If TSF (WIT) designates the task and benchmarking paradigm for time series forecasting models to produce scenario-guided, conditional forecasts—that is, predictions that account for explicit hypothetical interventions or contextual guidance, such as textual “what-if” scenarios. Motivated by operational needs in finance, policy, energy, and infrastructure, WIT pushes beyond historical extrapolation to evaluate whether models can meaningfully condition on plausible or counterfactual context and accurately simulate consequences. Recent research formalizes WIT as both a multimodal benchmarking suite and as a causal intervention methodology within deep foundation models, opening new directions for semantic scenario analysis, interpretability, and strategic decision making (Jang et al., 13 Jan 2026, Sanyal et al., 6 Sep 2025, Perez-Diaz et al., 22 Oct 2025, Xu et al., 2023, Wexler et al., 2019, Zheng et al., 2021).
1. Formal Definition and Motivating Problem Statement
Standard time series forecasting models operate on pure numerical history: where is the past and produces predictions for a fixed horizon . In practice, human experts and decision-makers require forecasts answering counterfactual queries, e.g., “What happens if a policy is enacted tomorrow?” or “How does demand respond to an economic shock?” WIT recasts the prediction function as jointly dependent on historical data and a contextual scenario : Here, scenario may be multimodal—often structured as domain description , historical context , and future outlook or intervention , which can be plausible () or counterfactual () (Jang et al., 13 Jan 2026). This conditional setup rigorously tests whether models recognize external guidance and can alter their predictions accordingly, rather than simply extrapolating past observations.
2. Benchmark Construction and Textual Scenario Integration
The “What If TSF” benchmark (Jang et al., 13 Jan 2026) establishes rigorous evaluation by assembling 5,352 samples spanning politics, society, energy, and economy domains. Each sample comprises:
- Numerical history () of appropriate length per domain
- A structured scenario consisting of:
- Domain/variable description ()
- Historical context (), summarizing major events
- Future scenario (), crafted to reflect plausible or counterfactual developments
All scenarios are curated by (i) raw event/news collection, (ii) LLM-based summarization and de-identification, and (iii) expert verification to ensure coherence and prevent information leakage. Counterfactuals are generated by minimal edits and cross-checked algorithmically and by domain experts. The resulting format supports both short-term and long-term forecasting, with explicit directional targets.
| Domain | History Length | ST Horizon | LT Horizon | Counterfactual Horizon |
|---|---|---|---|---|
| Politics | 8 | 1 | 4 | 1 |
| Society | 8 | 1 | 4 | 1 |
| Energy | 30 | 5 | 20 | 5 |
| Economy | 90 | 5 | 30 | 5 |
This scenario architecture is explicitly multimodal and tests the model's capacity for context-dependent forecasting.
3. Model Classes and Conditional Forecasting Mechanisms
The WIT protocol admits a range of model classes. LLM-based forecasters are prompted as zero-shot predictors, concatenating numerical and scenario text (Jang et al., 13 Jan 2026). Foundation models (TSFMs; e.g., Chronos-Bolt-Base) and classical methods (ARIMA, ETS, Holt-Winters) serve as unimodal baselines. Leading models fuse modalities via cross-attention:
Ablations systematically evaluate the incremental value of textual description, historical context, and future scenario. Empirically, textual future scenarios () produce the largest uplift in directional accuracy (+25–30pp), whereas historical context alone yields minimal gains (Jang et al., 13 Jan 2026). LLMs (e.g., GPT-4o, Gemma-3-27B, Qwen2.5) consistently outperform state-space and TSFM models in scenario-guided setting, particularly on explicit counterfactual tasks.
4. Causal Intervention and Semantic “What-If” Algorithms
Going beyond scenario input conditioning, recent work investigates direct causal manipulation within foundation models (Sanyal et al., 6 Sep 2025). The “activation transplantation” methodology formalizes semantic intervention at the level of hidden states. Given a pretrained TSFM (Chronos or Toto), the intervention algorithm proceeds:
- Compute layer- activation moments for a style event :
- Extract moments from the target context :
- Standardize and transplant:
The modified activation sequence is then injected at layer , determining the subsequent forecast trajectory. This technique deterministically induces regime shifts (e.g., from calm to crash), and the Euclidean norm of the transplanted latent vector modulates event severity, with observed correlation between norm and forecast decline (Sanyal et al., 6 Sep 2025).
5. Evaluation Metrics, Experimental Outcomes, and Scenario Sensitivity
WIT experiments employ quantitative metrics: Mean Squared Error (MSE), 3-way directional accuracy (“rise,” “fall,” “unchanged”), and quantile loss for probabilistic assessment. Key findings include:
- Unimodal TSFMs (Chronos): Politics ST, MSE ≈ 18, accuracy ≈ 53%
- LLMs with full scenario context (S+H+F): Politics ST, MSE = 13.5, accuracy = 91.9%, counterfactual accuracy up to 98.3% (Jang et al., 13 Jan 2026)
- Scenario ablation: future scenario context () is essential, historical context () alone offers limited uplift
Activation transplantation validates conditional control within TSFMs:
- Inducing crash semantics produces forecast drops (ΔForecast₉₀ = −0.12)
- Injecting calm restores stability (+18% forecast swing, variance reduction = 20%)
- Severity control is nearly linear (dose-response ), uncertainty scales with intervention strength (Sanyal et al., 6 Sep 2025)
6. Mechanisms for Path-Dependent Event Probability and Forecast Form
Time-series foundation models support diverse forecast forms; path-dependent event probabilities (e.g., threshold crossings) are identifiable only from joint trajectory ensembles, not marginal distributions. Sklar’s theorem underpins the non-identifiability: infinitely many joint laws share marginals but diverge on path-level events (Perez-Diaz et al., 22 Oct 2025). Valid scenario simulation and risk analysis require forecast forms that preserve temporal dependence; evaluation must align with operational requirements using joint metrics (Energy Score, Variogram Score, Integrated Brier Score).
7. Interpretability, Limitations, and Future Extensions
WIT marks a transition from post-hoc attribution (e.g., saliency maps) to proactive, semantic scenario simulation, directly probing model causal mechanisms (Sanyal et al., 6 Sep 2025). Limitations include:
- Focus on directional/trend labels, not fully probabilistic or quantile forecasts (Jang et al., 13 Jan 2026)
- High curation cost for high-fidelity scenario text limits scalability
- LLMs may violate semantic range (e.g., producing out-of-bounds values)
- TraffNet and related models require real-time or accurate estimators of causal features (e.g., OD demands) for traffic assignment (Xu et al., 2023)
Proposed extensions target broader modalities (images, structured metadata), scalable scenario generation pipelines (end-to-end LLMs), and enhanced integration of causal intervention techniques. The domain is converging on semantic, context-aware, risk-sensitive forecasting frameworks capable of emulating expert human scenario reasoning.
In summary, “What-If TSF” defines the foundational task, benchmark, and methodology for scenario-guided, conditional prediction in time series models, operationalized through curated multimodal benchmarks, causal intervention mechanisms, and rigorous evaluation protocols. This paradigm establishes rigorous standards for semantic, proactive forecasting in high-stakes domains.