Multimodal Time Series Forecasting

Updated 9 February 2026

Multimodal Time Series Forecasting is the task of predicting future numerical values by fusing time series data with auxiliary modalities like text and images.
It employs diverse methodologies such as parallel encoder fusion, controller modulation, and joint generative modeling to improve accuracy and uncertainty quantification.
Empirical studies demonstrate significant gains in forecast performance when contextual data is effectively integrated, although careful fusion design is essential.

Multimodal time series forecasting is the task of predicting future values of a time-dependent process by integrating both quantitative time series data and one or more additional modalities—most commonly text, but also structured metadata, images, or symbolic information. The goal is to leverage context (such as event reports, news, expert commentary, or physics equations) that augments or disambiguates the partial signal present in past numerical observations. This paradigm emerges from recognition that critical drivers of system behavior (macroeconomic shifts, medical events, control policies) are sometimes only partially or indirectly encoded in the observed series, but may be captured in other forms. Modern multimodal forecasting research investigates architectures, objectives, and theoretical principles for fusing such heterogeneous data sources in order to improve accuracy, robustness, interpretability, and uncertainty quantification.

1. Problem Formulation and Motivation

Multimodal time series forecasting can be formalized as learning a mapping

$p(y_{T+1:T+H} \mid x_{1:T},\; m_{1:T}),$

where $x_{1:T}$ is the observed historical numerical time series, and $m_{1:T}$ denotes additional modalities that may include text ( $s_{1:T}$ ), exogenous time series, symbolic equations, static features, or images. The core motivation is that relevant causal factors or transition events may be most salient or exclusively visible in these auxiliary modalities, especially in domains such as finance, energy, or healthcare (Qin, 21 May 2025, Nguyen et al., 2 Feb 2026, Seo et al., 11 Dec 2025).

Two core scenarios have driven research:

Standard (history-rich) forecasting: Sufficient history is available, and modalities act as context cues augmenting the time series.
Cold-start or sparse-data forecasting: Limited or no numerical history, so modalities provide primary predictive signal (Zhou et al., 21 May 2025).

Formalisms now explicitly recognize the need for causal soundness of text, i.e., the inclusion of only exogenous, non-leaking information within the auxiliary modalities (Xu et al., 29 Sep 2025).

2. Architectural Paradigms for Multimodal Forecasting

Architectures in this field typically fall into several major families:

A. Alignment-Based Early/Intermediate Fusion

Parallel encoders process numerical and auxiliary modalities (e.g., BERT for text, MLP/Transformer for time series), then fuse latent representations (via addition, concatenation, cross-attention) at a mid or late stage (Qin, 21 May 2025, Seo et al., 11 Dec 2025, Zhou et al., 30 Aug 2025, Cho et al., 23 Oct 2025, Wu et al., 2 May 2025).
Contrastive or InfoNCE objectives may be employed to align representations across modalities (e.g., $\mathcal{L}_\text{align}$ ) (Zhou et al., 30 Aug 2025, Wu et al., 2 May 2025).

B. Modality as Controller / Modulator

Approaches such as Adaptive Information Routing (AIR) and Expert Modulation MoE condition the flow of information through the time series network on text embeddings, using text to generate gating or weighting vectors that dynamically influence hidden pathways or expert activations (Seo et al., 11 Dec 2025, Zhang et al., 29 Jan 2026).
This structure allows text to act as a controller over the time series model, as opposed to being treated as a simple additive source.

C. Joint Generative Modeling and Latent State-Space Methods

Recent work integrates LLMs with probabilistic state-space models (SSMs), jointly generating both numerical and textual observations from a common latent state, offering flexible windowing, formal uncertainty, and text-conditioned posterior inference (Cho et al., 23 Oct 2025).
Generative diffusion models and flow-matching architectures extend this to support full predictive distributions, with multimodal tokens guiding or modulating the denoising/generation process (Zhang et al., 8 Dec 2025, Zhang et al., 6 Feb 2026, Wu et al., 26 Sep 2025).

D. Benchmarking and Late Fusion Baselines

Simple but effective approaches freeze a pretrained LLM/text encoder, freeze a time-series encoder, and fuse the outputs via a learned scalar (late fusion), demonstrating surprisingly strong performance across diverse benchmarks (Liu et al., 2024, Xu et al., 29 Sep 2025).

Table: Select Architectural Strategies

Strategy	Example Models / Papers	Fusion Location
Parallel encoders + addition	BALM-TSF (Zhou et al., 30 Aug 2025)	Late
Cross-attention fusion	Dual-Forecaster (Wu et al., 2 May 2025)	Mid/Late
Gated/AIR routing	AIR (Seo et al., 11 Dec 2025)	All layers (modulatory)
MoE modulation	MoME (Zhang et al., 29 Jan 2026)	Expert function-level
Latent SSM + LLM	LBS (Cho et al., 23 Oct 2025)	Posterior/generative
Frequency-domain fusion	SpecTF (Nguyen et al., 2 Feb 2026)	Spectral, global
Diffusion + multi-modal parallelization	UniDiff (Zhang et al., 8 Dec 2025); Aurora (Wu et al., 26 Sep 2025)	Unified, generative

Fusion of heterogeneous modalities is non-trivial due to mismatched temporal indexing, scales, and semantics. Primary methods include:

Additive or Concatenation Fusion: Combining modality embeddings via arithmetic operations. Requires distribution balancing (such as horizon-aware scaling in BALM-TSF (Zhou et al., 30 Aug 2025)).
Attention-Based and Cross-Attention Fusion: Enables selective weighting of one modality’s tokens conditioned on another, critical in scenarios with long or sparse inputs (Wu et al., 2 May 2025, Seo et al., 11 Dec 2025, Nguyen et al., 2 Feb 2026).
Contrastive and Distributional Objectives: InfoNCE or symmetric contrastive terms align embeddings semantically, improving stability and robustness (Zhou et al., 30 Aug 2025, Wu et al., 2 May 2025).
Frequency-Domain Fusion: Projects both time series and textual embeddings into spectral (frequency) space, allowing multi-scale textual modulation (as in SpecTF (Nguyen et al., 2 Feb 2026)).
Controller/Expert Modulation: Instead of direct fusion, text modulates network components or expert outputs, enhancing adaptability and interpretability (Seo et al., 11 Dec 2025, Zhang et al., 29 Jan 2026).

These fusion designs are critically linked to observed empirical success, with strong evidence that improper or imbalanced fusion can result in over-reliance on one modality, degraded robustness, and loss of interpretability (Zhou et al., 30 Aug 2025, Qin, 21 May 2025).

4. Benchmarks, Data Considerations, and Empirical Gains

Recent work emphasizes the need for high-fidelity, causally-grounded benchmarks. Key requirements include:

Data Sourcing Integrity: Ensuring that numerical and textual modalities are contemporaneous and not subject to pretraining contamination (Xu et al., 29 Sep 2025).
Strict Causal Soundness: Text input must not leak future outcomes or confound as a description of the target (see do-calculus formalization) (Xu et al., 29 Sep 2025).
Structural Clarity: Datasets should precisely index subjects and channels, enabling rigorous hold-out and transfer experiments (Xu et al., 29 Sep 2025, Zhou et al., 21 May 2025, Liu et al., 2024).

Notable benchmarks include Time-MMD (9 domains), Fidel-TS (with causally clean text), MoTime (supporting cold-start and common scenarios), and domain-specific datasets as in D-Sine / D-Tumor (Qin, 21 May 2025, Liu et al., 2024, Xu et al., 29 Sep 2025, Zhou et al., 21 May 2025).

Empirical improvement: Multimodal models consistently achieve substantial reductions in MSE and MAE relative to unimodal baselines—but these gains are tightly correlated with text density and causal utility. For example, FIATS reduces MSE by 15–40% vs. the best unimodal model on causally relevant datasets; SpecTF yields 3.8% lower MSE over multimodal time-domain baselines, and gains reach up to 40% where text is dense and aligned (Xu et al., 29 Sep 2025, Nguyen et al., 2 Feb 2026, Liu et al., 2024). However, irrelevant or causally ambiguous textual data provides little or negative benefit (Xu et al., 29 Sep 2025, Zhang et al., 20 Jun 2025).

5. Interpretability, Robustness, and Theoretical Insights

A defining challenge of multimodal approaches is preserving model transparency and robustness. Advances along these lines include:

Trajectory Decomposition: Encoding exogenous series into interpretable trends and transition properties enables attribution and counterfactual sensitivity analysis, as in extensions to TIMEVIEW (Qin, 21 May 2025).
Expert/Controller Modulation: AIR and MoME architectures allow tracing which "latent pathway" or expert is dynamically activated by textual cues, supporting structured analysis of model reaction to exogenous events (Seo et al., 11 Dec 2025, Zhang et al., 29 Jan 2026).
Spectral and Token Sensitivity: SpecTF enables visualization of which frequency bands are reweighted in response to specific text, linking back global or local text cues to time series dynamics (Nguyen et al., 2 Feb 2026).
Robustness to Noise: Decomposing and encoding modalities as trends/properties, or through spectral filtering, provides robustness to local perturbations and data noise (Qin, 21 May 2025, Nguyen et al., 2 Feb 2026).
Theoretical Error Bounds: MoME provides analytic bounds on error from sparse expert selection, connecting multimodal expert pruning to truncated PCA (Zhang et al., 29 Jan 2026).

6. Limitations, Open Questions, and Future Directions

Despite the rapid evolution of methods, several limitations and questions remain open:

Selective or Noisy Modality Impact: Causally irrelevant or noisy text may degrade forecasts, and many models lack mechanisms to discount such information robustly (Xu et al., 29 Sep 2025, Zhang et al., 20 Jun 2025).
Cold-Start and Transfer Learning: Modality utility is especially pronounced in cold-start or data-scarce settings, yet dataset diversity and task framing (static vs. dynamic modalities) require deeper investigation (Zhou et al., 21 May 2025, Zhang et al., 20 Jun 2025, Liu et al., 2024).
Scalability and Efficiency: Generative and diffusion-based models introduce greater sampling overhead compared to point-forecasting models, with an ongoing need for efficiency improvements (Zhang et al., 8 Dec 2025, Zhang et al., 6 Feb 2026, Wu et al., 26 Sep 2025).
Uncertainty Quantification: Only a subset of models—especially those with probabilistic or generative backbones—support formal predictive intervals or uncertainty estimation (Cho et al., 23 Oct 2025, 2505.10774, Zhang et al., 8 Dec 2025).
Interpretability at Scale: While localized attribution is available in some architectures, automatic explanations for cross-modal and temporal interactions are still a research frontier (Qin, 21 May 2025, Nguyen et al., 2 Feb 2026, Cho et al., 23 Oct 2025).
Benchmarks and Generalization: There is no universal winner; benchmarks reveal that no single model generalizes best across all domains, horizons, and modalities (Xu et al., 29 Sep 2025, Zhang et al., 20 Jun 2025).

Future work is expected to further investigate adaptive fusion, domain adaptation, online and streaming scenarios, and the integration of additional modalities such as vision and symbolic knowledge (Zhou et al., 21 May 2025, Wu et al., 26 Sep 2025, Jollie et al., 2024).

7. Practical Guidelines and Conditions for Multimodality Gains

Comprehensive analyses indicate that the benefit of multimodality is highly condition-dependent (Zhang et al., 20 Jun 2025, Xu et al., 29 Sep 2025). Key practical takeaways include:

Multimodal integration is most beneficial when:
- Training data is sufficiently abundant for both modalities.
- The text (or other modalities) offers genuinely novel, complementary signal not present in the numerical series.
- The time series model is relatively weak or the domain is subject to abrupt regime shifts not easily modeled from history alone.
- Fusion architecture enables balanced or controlled integration, avoiding modality dominance or information dilution.
Alignment objectives and contrastive regularization are crucial to avoid overfitting or modality neglect.
Empirical assessment should include explicit measurement of each modality's marginal benefit, ablations for modality removal, and causal tests for leaky or endogenous text (Zhang et al., 20 Jun 2025, Xu et al., 29 Sep 2025).

In summary, multimodal time series forecasting delivers measurable improvements in both accuracy and practical robustness, as long as methodological rigor is maintained in modality integration, data handling, and evaluation. The field continues to co-evolve with advances in foundation models, dynamic fusion architectures, and high-quality, causally grounded benchmarks.