Time-Series Question Answering (TSQA)

Updated 8 October 2025

Time-Series Question Answering (TSQA) is a research field that processes natural language queries on time-stamped data, integrating temporal reasoning, forecasting, and multimodal analysis.
It employs methods such as temporal expression extraction, multi-hop inference, and data fusion from texts, graphs, and sensor signals to address dynamic information needs.
Applications span finance, healthcare, and logistics, where TSQA systems enhance decision support by ensuring timely, accurate retrieval of temporal facts.

Time-Series Question Answering (TSQA) is a research area concerned with designing, evaluating, and deploying models and systems that answer natural language questions involving temporal information, time-evolving facts, or reasoning directly from time-stamped data sources. TSQA encompasses a spectrum of methodologies—ranging from text-centric temporal QA and knowledge-graph-based temporal QA, to multi-modal systems integrating numerical time series and contextual language, and explicit time series signal analysis via domain-specific agents. The field addresses critical questions in dynamic domains such as forecasting, factual temporal knowledge retrieval, scenario-driven planning, and robust multi-hop temporal reasoning.

1. Task Formalization and Subproblem Taxonomy

TSQA tasks are defined by their reliance on time-evolving inputs and require models to restrict retrieval or reasoning to evidence aligned with temporal constraints. This includes:

Forecasting-oriented QA: Given a corpus of news articles or historical records with timestamps, answer questions about events that occur after the latest available information, using only past data (e.g., ForecastQA (Jin et al., 2020)).
Temporal KGQA: Answering questions over temporal knowledge graphs (KGs), inferring time intervals, entity roles, or factual transitions via timestamp estimation and temporal order modeling (e.g., (Shang et al., 2022)).
Streaming and Continually Evolving QA: Models must adapt knowledge as new sources become available over time, balancing adaptation with retention (e.g., StreamingQA (Liška et al., 2022), CLTSQA (Yang et al., 2024)).
Temporal Reasoning over Text and Multimodal Inputs: Handling cross-modal QA involving numerical time series with associated natural language (e.g., Chat-TS (Quinlan et al., 13 Mar 2025), MTBench (Chen et al., 21 Mar 2025), ITFormer (Wang et al., 25 Jun 2025)).

Critical subproblems include:

Temporal expression extraction and normalization;
Reasoning over explicit and implicit temporal constraints;
Multi-hop temporal inference;
Temporal rationale faithfulness in answers;
Handling diachronic and multimodal corpora.

2. Datasets and Benchmarks

A rich variety of datasets drive TSQA research, reflecting diversity in data modality, question complexity, and temporal reasoning depth:

Dataset/Benchmark	Data Types	Key Focus	Scale / Coverage
ForecastQA (Jin et al., 2020)	News Text (time-stamped)	Event Forecasting	10,392 Q-A pairs, 5 years
StreamingQA (Liška et al., 2022)	News Text, Timelines	Adaptation, Drift	14 years, quarterly splits
ComplexTempQA (Gruber et al., 2024)	Wikipedia/Wikidata	Multi-hop, Large	100M+ pairs, 36 years
EngineMT-QA (Wang et al., 25 Jun 2025)	Sensor TS + Text	Multimodal QA	110K Q-A pairs, real-world
MTBench (Chen et al., 21 Mar 2025)	Financial/Weather TS + Text	Cross-modal QA	Multi-domain, labeled tasks
TDBench (Kim et al., 4 Aug 2025)	Temporal DB	Factual QA, Eval	6K+ pairs, 13 operators
CLTSQA-Data (Yang et al., 2024)	WikiData/Text	Continual Learning	50K Qs, ∼5K contexts, staged
UnSeenTimeQA (Uddin et al., 2024)	Synthetic Scenarios	Reasoning-only	Unlimited, no web leakage

Significant advances in dataset construction include:

Systematic use of temporal SQL, temporal functional dependencies, and temporal joins for scalable QA generation (e.g., TDBench (Kim et al., 4 Aug 2025));
Synthetic, contamination-free settings to stress pure temporal reasoning (e.g., UnSeenTimeQA (Uddin et al., 2024));
Massive coverage both in terms of modalities and reasoning depth (e.g., ComplexTempQA (Gruber et al., 2024), MTBench (Chen et al., 21 Mar 2025)).

3. Methodologies and Model Architectures

Approaches in TSQA span several paradigms reflecting both linguistic and numerical aspects:

Temporal Text and Knowledge Graph QA

Temporal Cutoff Enforcement: Strictly limiting accessible evidence to pre-specified time points to simulate real-world forecasting (e.g., ForecastQA (Jin et al., 2020)).
Timestamp Estimation and Temporal Embeddings: Inferring latent timestamps from questions, employing multi-linear interactions and sinusoidal positional encodings (e.g., TCompLEx score: $S(s, r, t, o) = \operatorname{Re}(\langle e_s, e_r, e_o, e_t \rangle)$ (Shang et al., 2022)).
Contrastive and Auxiliary Losses: Enforcing temporal order and contrastive learning over question pairs differing only in time expressions (Shang et al., 2022, Son et al., 2023, Yang et al., 2024).
Temporal Graph Extraction and Fusion: Construction of event–time–relation graphs (via CAEVO, SUTime), with fusion by explicit edge representation or GNN modules in transformers (e.g., ERR fusion, RelGraphConv update) (Su et al., 2023).

Multimodal and Time-Series Integration

Time-Series Encoders Coupled to LLMs: Models like ITFormer (Wang et al., 25 Jun 2025) employ hierarchical position encoding (temporal, channel, segment), learnable instruction tokens, and instruct time attention to align/fuse time-series representations with frozen LLMs.
Discrete Time-Series Tokenization: Methods such as Chat-TS (Quinlan et al., 13 Mar 2025) convert numerical series to discrete tokens, extending LLM vocabulary for direct joint reasoning.
Program-Aided Decomposition: Domain agents such as TS-Reasoner (Ye et al., 2024) translate natural language into structured workflows, execute precise numeric/statistical computations, and incorporate domain knowledge, with adaptive self-refinement.

Learning with Noisy or Pseudo-Labels

Pseudo-Labeling via VLMs: Large-scale TSQA models can be effectively trained with labels produced by VLMs (e.g., GPT-4o), exploiting the noise robustness of DNNs to achieve accuracy higher than the pseudo-label generator (Fujimura et al., 30 Sep 2025).

4. Evaluation Methodologies and Metrics

Multiple tailored metrics and evaluation protocols have been introduced for TSQA:

Traditional QA Metrics: Exact Match (EM), F1, set-level accuracy for multi-answer cases (Tan et al., 2023, Kong et al., 26 Feb 2025).
Time Accuracy (T) and Answer-Time Accuracy (AT): Evaluating not only the returned answer but the correctness of temporal justifications, with partial credit for cases where only some required dates are correct (Kim et al., 4 Aug 2025):

$T(q) = \frac{|\{t \in f(q)\, \text{correctly predicted}\}|}{|f(q)|} \times 100\%$

Brier Score: Calibration of probabilistic predictions (Jin et al., 2020):

$\text{Brier Score} = \frac{1}{N} \sum_{i=1}^N \sum_{c=1}^C \left(p_{ic} - y_{ic}\right)^2$

Domain-Specific Success Metrics: E.g., Absolute Average Profit, Relative Average Profit, and MAPE for time-series inference tasks (Ye et al., 2024).

5. Current Results, Robustness, and Open Challenges

Analysis across established benchmarks demonstrates:

Even the best performing BERT-based models on event forecasting lag human judgment by at least 10–19% accuracy (e.g., 60.1% vs. 71.2–79.4% in ForecastQA (Jin et al., 2020)).
Systems that model temporal order explicitly (e.g., GRU-based aggregation) and contrastive learning on time-expressions show marked improvements, e.g., 32% absolute error reduction in temporal KGQA (Shang et al., 2022).
Multimodal models like ITFormer outperform adapted vision–language approaches and general LLMs, while using fewer than 1% additional trainable parameters (Wang et al., 25 Jun 2025).
Robustness analysis (e.g., UnSeenTimeQA (Uddin et al., 2024)) reveals that LLMs excel at shallow or memorization tasks but degrade significantly for multi-step event dependencies and parallel events (up to 45% performance drop on hard splits).
In factual, database-driven QA, significant time hallucination persists—average drops of 21.7% when correctness of temporal references is explicitly required alongside content (Kim et al., 4 Aug 2025).

6. Future Directions and Open Research Problems

Open challenges and future directions, as outlined across recent work, include:

Automated Adaptation and Continual Learning: Frameworks combining temporal memory replay and contrastive learning (as in CLTSQA (Yang et al., 2024)) are necessary to cope with knowledge drift and catastrophic forgetting in dynamic environments.
Faithfulness in Temporal Justification: Methods enforcing and evaluating the temporal consistency of answer rationales (e.g., the Faith framework (Jia et al., 2024), TDBench (Kim et al., 4 Aug 2025)) are critical for high-stakes domains.
Fine-Grained Temporal and Multi-Hop Reasoning: Dataset design, augmentation strategies (e.g., pseudo-instruction tuning, temporal shifting (Tan et al., 2023)), and complex temporally stratified benchmarks (e.g., ComplexTempQA (Gruber et al., 2024)) are central for progress.
Scalability, Efficiency, and Domain Adaptation: Efficient lightweight modules connecting structured TS encoders to LLMs, parameter-efficient fine-tuning, and domain-specific module generation are demonstrated to be effective (e.g., ITFormer (Wang et al., 25 Jun 2025), TS-Reasoner (Ye et al., 2024)).
Evaluation Beyond Memorization: Synthetic, contamination-free settings (e.g., UnSeenTimeQA (Uddin et al., 2024)) and robust pseudo-labeling techniques (e.g., (Fujimura et al., 30 Sep 2025)) allow for stringent evaluation of true reasoning versus retrieval or memorization.

7. Practical Applications and Impacts

TSQA methods are foundational in:

Policy and civil unrest forecasting from news streams (Jin et al., 2020)
Fact-checking temporal claims from structured/unstructured sources (Jia et al., 2024, Kim et al., 4 Aug 2025)
Healthcare and patient monitoring, finance, and IoT analysis via multimodal TSQA (Quinlan et al., 13 Mar 2025, Kong et al., 26 Feb 2025, Wang et al., 25 Jun 2025)
Automated scenario planning and resource allocation in logistics; industrial monitoring (e.g., aeronautical engines, manufacturing processes) (Wang et al., 25 Jun 2025)
Personalized assistants and decision support that combine narrative context and time series prediction (Ye et al., 2024).

The field’s continued innovation in scalable benchmarks, robust reasoning modules, cross-modal architectures, and faithfulness evaluation is steadily bridging the gap between machine and human capabilities in temporal reasoning and time-sensitive decision making.