Papers
Topics
Authors
Recent
Search
2000 character limit reached

What-If TSF: Scenario-Guided Forecasting

Updated 20 January 2026
  • What-If TSF is a conditional time series forecasting paradigm that integrates numerical history with structured text scenarios for counterfactual predictions.
  • It curates 5,352 samples across domains like politics, energy, and finance to rigorously test models with multimodal, scenario-guided inputs.
  • The methodology employs cross-attention fusion and causal interventions like activation transplantation to significantly enhance directional forecast accuracy.

What-If TSF (WIT) designates the task and benchmarking paradigm for time series forecasting models to produce scenario-guided, conditional forecasts—that is, predictions that account for explicit hypothetical interventions or contextual guidance, such as textual “what-if” scenarios. Motivated by operational needs in finance, policy, energy, and infrastructure, WIT pushes beyond historical extrapolation to evaluate whether models can meaningfully condition on plausible or counterfactual context and accurately simulate consequences. Recent research formalizes WIT as both a multimodal benchmarking suite and as a causal intervention methodology within deep foundation models, opening new directions for semantic scenario analysis, interpretability, and strategic decision making (Jang et al., 13 Jan 2026, Sanyal et al., 6 Sep 2025, Perez-Diaz et al., 22 Oct 2025, Xu et al., 2023, Wexler et al., 2019, Zheng et al., 2021).

1. Formal Definition and Motivating Problem Statement

Standard time series forecasting models operate on pure numerical history: y^t+1:t+h=f(x1:t)\hat y_{t+1:t+h} = f(x_{1:t}) where (x1,...,xt)(x_1, ..., x_t) is the past and ff produces predictions for a fixed horizon hh. In practice, human experts and decision-makers require forecasts answering counterfactual queries, e.g., “What happens if a policy is enacted tomorrow?” or “How does demand respond to an economic shock?” WIT recasts the prediction function as jointly dependent on historical data and a contextual scenario ss: y^t+1:t+h=f(x1:t,s)\hat y_{t+1:t+h} = f(x_{1:t}, s) Here, scenario ss may be multimodal—often structured as domain description SS, historical context HH, and future outlook or intervention FF, which can be plausible (FplF^{\mathrm{pl}}) or counterfactual (FcfF^{\mathrm{cf}}) (Jang et al., 13 Jan 2026). This conditional setup rigorously tests whether models recognize external guidance and can alter their predictions accordingly, rather than simply extrapolating past observations.

2. Benchmark Construction and Textual Scenario Integration

The “What If TSF” benchmark (Jang et al., 13 Jan 2026) establishes rigorous evaluation by assembling 5,352 samples spanning politics, society, energy, and economy domains. Each sample comprises:

  • Numerical history (x1:tx_{1:t}) of appropriate length per domain
  • A structured scenario consisting of:
    • Domain/variable description (SS)
    • Historical context (HH), summarizing major events
    • Future scenario (FF), crafted to reflect plausible or counterfactual developments

All scenarios are curated by (i) raw event/news collection, (ii) LLM-based summarization and de-identification, and (iii) expert verification to ensure coherence and prevent information leakage. Counterfactuals are generated by minimal edits and cross-checked algorithmically and by domain experts. The resulting format supports both short-term and long-term forecasting, with explicit directional targets.

Domain History Length ST Horizon LT Horizon Counterfactual Horizon
Politics 8 1 4 1
Society 8 1 4 1
Energy 30 5 20 5
Economy 90 5 30 5

This scenario architecture is explicitly multimodal and tests the model's capacity for context-dependent forecasting.

3. Model Classes and Conditional Forecasting Mechanisms

The WIT protocol admits a range of model classes. LLM-based forecasters are prompted as zero-shot predictors, concatenating numerical and scenario text (Jang et al., 13 Jan 2026). Foundation models (TSFMs; e.g., Chronos-Bolt-Base) and classical methods (ARIMA, ETS, Holt-Winters) serve as unimodal baselines. Leading models fuse modalities via cross-attention: hts=Encseries(x1:t),htext=Enctext(S,H,F)h_{\mathrm{ts}} = \mathrm{Enc}_{\mathrm{series}}(x_{1:t}), \qquad h_{\mathrm{text}} = \mathrm{Enc}_{\mathrm{text}}(S, H, F)

h=Attention(hts,htext),y^t+1:t+h=Dec(h)h = \mathrm{Attention}(h_{\mathrm{ts}}, h_{\mathrm{text}}), \qquad \hat y_{t+1:t+h} = \mathrm{Dec}(h)

Ablations systematically evaluate the incremental value of textual description, historical context, and future scenario. Empirically, textual future scenarios (FF) produce the largest uplift in directional accuracy (+25–30pp), whereas historical context alone yields minimal gains (Jang et al., 13 Jan 2026). LLMs (e.g., GPT-4o, Gemma-3-27B, Qwen2.5) consistently outperform state-space and TSFM models in scenario-guided setting, particularly on explicit counterfactual tasks.

4. Causal Intervention and Semantic “What-If” Algorithms

Going beyond scenario input conditioning, recent work investigates direct causal manipulation within foundation models (Sanyal et al., 6 Sep 2025). The “activation transplantation” methodology formalizes semantic intervention at the level of hidden states. Given a pretrained TSFM (Chronos or Toto), the intervention algorithm proceeds:

  1. Compute layer-\ell activation moments for a style event XsX_s:

μs=1Tt=1Tht(s),Σs=1Tt=1T(ht(s)μs)(ht(s)μs)\mu_s = \frac{1}{T} \sum_{t=1}^T h_t^{(s)}, \quad \Sigma_s = \frac{1}{T} \sum_{t=1}^T (h_t^{(s)} - \mu_s)(h_t^{(s)} - \mu_s)^\top

  1. Extract moments from the target context XtgtX_{\text{tgt}}:

μt,Σt\mu_t, \Sigma_t

  1. Standardize and transplant:

zt=Σt1/2(ht(t)μt),ht=Σs1/2zt+μsz_t = \Sigma_t^{-1/2} (h_t^{(t)} - \mu_t), \quad h'_t = \Sigma_s^{1/2} z_t + \mu_s

The modified activation sequence hth'_t is then injected at layer \ell, determining the subsequent forecast trajectory. This technique deterministically induces regime shifts (e.g., from calm to crash), and the Euclidean norm of the transplanted latent vector modulates event severity, with observed correlation r0.95r \approx 0.95 between norm and forecast decline (Sanyal et al., 6 Sep 2025).

5. Evaluation Metrics, Experimental Outcomes, and Scenario Sensitivity

WIT experiments employ quantitative metrics: Mean Squared Error (MSE), 3-way directional accuracy (“rise,” “fall,” “unchanged”), and quantile loss for probabilistic assessment. Key findings include:

  • Unimodal TSFMs (Chronos): Politics ST, MSE ≈ 18, accuracy ≈ 53%
  • LLMs with full scenario context (S+H+F): Politics ST, MSE = 13.5, accuracy = 91.9%, counterfactual accuracy up to 98.3% (Jang et al., 13 Jan 2026)
  • Scenario ablation: future scenario context (FF) is essential, historical context (HH) alone offers limited uplift

Activation transplantation validates conditional control within TSFMs:

  • Inducing crash semantics produces forecast drops (ΔForecast₉₀ = −0.12)
  • Injecting calm restores stability (+18% forecast swing, variance reduction = 20%)
  • Severity control is nearly linear (dose-response r=0.99r=0.99), uncertainty scales with intervention strength (Sanyal et al., 6 Sep 2025)

6. Mechanisms for Path-Dependent Event Probability and Forecast Form

Time-series foundation models support diverse forecast forms; path-dependent event probabilities (e.g., threshold crossings) are identifiable only from joint trajectory ensembles, not marginal distributions. Sklar’s theorem underpins the non-identifiability: infinitely many joint laws share marginals but diverge on path-level events (Perez-Diaz et al., 22 Oct 2025). Valid scenario simulation and risk analysis require forecast forms that preserve temporal dependence; evaluation must align with operational requirements using joint metrics (Energy Score, Variogram Score, Integrated Brier Score).

7. Interpretability, Limitations, and Future Extensions

WIT marks a transition from post-hoc attribution (e.g., saliency maps) to proactive, semantic scenario simulation, directly probing model causal mechanisms (Sanyal et al., 6 Sep 2025). Limitations include:

  • Focus on directional/trend labels, not fully probabilistic or quantile forecasts (Jang et al., 13 Jan 2026)
  • High curation cost for high-fidelity scenario text limits scalability
  • LLMs may violate semantic range (e.g., producing out-of-bounds values)
  • TraffNet and related models require real-time or accurate estimators of causal features (e.g., OD demands) for traffic assignment (Xu et al., 2023)

Proposed extensions target broader modalities (images, structured metadata), scalable scenario generation pipelines (end-to-end LLMs), and enhanced integration of causal intervention techniques. The domain is converging on semantic, context-aware, risk-sensitive forecasting frameworks capable of emulating expert human scenario reasoning.


In summary, “What-If TSF” defines the foundational task, benchmark, and methodology for scenario-guided, conditional prediction in time series models, operationalized through curated multimodal benchmarks, causal intervention mechanisms, and rigorous evaluation protocols. This paradigm establishes rigorous standards for semantic, proactive forecasting in high-stakes domains.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to What If TSF (WIT).