Papers
Topics
Authors
Recent
Search
2000 character limit reached

TSRBench: Time Series Reasoning Benchmark

Updated 28 January 2026
  • TSRBench is a comprehensive benchmarking suite that evaluates time series reasoning through 4,125 problems across 14 diverse domains.
  • It categorizes tasks into four dimensions—Perception, Reasoning, Prediction, and Decision-Making—using textual, visual, or combined modalities.
  • Rigorous evaluation protocols and ablation studies reveal insights into model scaling, modality complementarity, and the need for specialized quantitative training.

TSRBench is a comprehensive, multi-modal, and multi-task benchmarking suite developed to systematically evaluate the time series reasoning capabilities of generalist LLMs, vision-LLMs (VLMs), and time-series-specialized LLMs (TSLLMs). Motivated by the ubiquity of temporal data in critical real-world applications—such as energy management, finance, healthcare, and traffic—and the absence of such reasoning challenges in generalist model benchmarks, TSRBench introduces 4,125 problems spanning 14 domains. The benchmark categorizes tasks into four principal dimensions: Perception, Reasoning, Prediction, and Decision-Making. With a formalized structure and rigorous evaluation metrics, TSRBench provides researchers with a standardized framework to assess model performance across diverse modalities and reasoning levels (Yu et al., 26 Jan 2026).

1. Motivation, Scope, and Design Principles

TSRBench addresses a significant gap in generalist model evaluation: existing benchmarks often neglect temporal reasoning or reduce time series to isolated numeric tasks, failing to capture the rich semantics and causal structures essential for real-world decision-making. TSRBench’s design emphasizes:

  • Comprehensiveness: The dataset spans 4,125 problem instances from 14 distinct domains, including laboratory, industrial, geoscientific, biomedical, economic, and social time series.
  • Multi-dimensionality: The benchmark encompasses 15 tasks distributed along four core reasoning dimensions—Perception, Reasoning, Prediction, Decision-Making.
  • Multi-modality: All tasks are available in textual, visual (plot), and combined textual-plus-visual formats, ensuring compatibility with LLMs, VLMs, and TSLLMs.
  • High-quality ground truth: The dataset comprises both real-world data (carefully aligned and anonymized) and synthetic data (programmatically generated with exact answers).

The suite is structured to probe the full spectrum of pattern recognition, semantic inference, quantitative forecasting, and applied decision-making required for time series intelligence (Yu et al., 26 Jan 2026).

2. Dataset Structure and Task Taxonomy

The TSRBench dataset consists of 4,125 question instances systematically distributed across 14 diverse application domains such as finance, meteorology, industrial control, epidemiology, and healthcare.

Task Dimensions and Types

Each task falls into one of four core reasoning dimensions:

Dimension Tasks (Count) Example Capabilities
Perception PA, NU, AD, SA Trend detection, noise estimation, anomaly localization, similarity comparison
Reasoning ER, CD, AR, TR, NR, DR, IR Causal inference, event sequencing, numeric reasoning, inductive/deductive logic
Prediction TSF, EP Numeric trajectory forecasting, discrete event prediction
Decision-Making QualDM, QuantDM Treatment selection, policy backtesting

Perception (4 tasks):

  • Pattern Analysis (PA): Identify dominant series components (trend, seasonality, noise).
  • Noise Understanding (NU): Estimate magnitude of noise.
  • Anomaly Detection (AD): Locate and classify anomalies.
  • Similarity Analysis (SA): Compare series for similarity in trend/distribution.

Reasoning (7 tasks):

  • Etiological Reasoning (ER): Infer latent causes.
  • Causal Discovery (CD): Select adjacency matrices depicting causal relations.
  • Abductive Reasoning (AR): Bridge observed events with plausible hypotheses.
  • Temporal Relation Reasoning (TR): Order events correctly over time.
  • Numerical/Deductive/Inductive Reasoning (NR, DR, IR): Perform computations, apply known rules, infer underlying generative mechanisms.

Prediction (2 tasks):

  • Time Series Forecasting (TSF): Multiple-choice future trajectory prediction.
  • Event Prediction (EP): Discrete event forecasting (binary/multi-class).

Decision-Making (2 tasks):

  • Qualitative Decision-Making (QualDM): Domain-guided policy selection.
  • Quantitative Decision-Making (QuantDM): Optimize metrics such as maximum drawdown under backtest simulation.

All tasks are presented in textual (T), visual (V), or combined (T+V) modes. TSLLMs operate on fixed-length embeddings learned from the series.

3. Evaluation Protocols and Metrics

Evaluation in TSRBench is structured to rigorously quantify model performance across heterogeneous tasks using standardized metrics:

  • Classification/Multiple-Choice Tasks: Accuracy, with Accuracy=#correct answers#total questions\mathrm{Accuracy} = \frac{\#\text{correct answers}}{\#\text{total questions}}.
  • Forecasting Tasks:
    • Mean Squared Error (MSE):

    %%%%1%%%% - Root Mean Squared Error (RMSE):

    RMSE=MSE\mathrm{RMSE} = \sqrt{\mathrm{MSE}} - Mean Absolute Error (MAE):

    MAE=1Tt=1Tyty^t\mathrm{MAE} = \frac{1}{T}\sum_{t=1}^{T}|y_t - \hat{y}_t|

  • Quantitative Decision-Making:

    • Maximum Drawdown (MDD):

    MDD=maxt[0,T]peakttroughtpeakt\mathrm{MDD} = \max_{t \in [0, T]} \frac{\mathrm{peak}_t - \mathrm{trough}_t}{\mathrm{peak}_t}

  • Correlational Analysis:

    • Spearman's rank correlation (ρ\rho):
    • ρ=16di2n(n21)\rho = 1 - \frac{6 \sum d_i^2}{n (n^2 - 1)}
    • where did_i is the rank difference between model size and performance.

All models are evaluated in zero-shot mode with chain-of-thought prompting enabled (“enable reasoning”). Task-specific fine-tuning is deliberately excluded to ensure standardized assessment (Yu et al., 26 Jan 2026).

4. Experimental Evaluation: Models, Modalities, and Ablations

TSRBench evaluates over 30 models—including proprietary, open-source LLMs/VLMs, and TSLLMs—under a uniform experimental protocol.

Model Classes and Example Architectures

Model Type Examples Input Modality
Proprietary LLMs GPT-5, Claude-4.5-Haiku, DeepSeek-V3.2 T
Open-Source LLMs Qwen2.5/Qwen3, Gemma3, InternLM3, GPT-OSS T
Open-Source VLMs Qwen2.5-VL, Llama-4-Scout, InternVL3.5 V
Multimodal Models Qwen3-VL, MiMo-VL, MiniCPM-V T+V
TSLLMs ChatTS-14B, TS-Reasoner-7B Embeddings
  • LLMs receive series encoded as textual tokens.
  • VLMs process standardized plot images (100 PPI).
  • Multimodal (T+V) approaches combine both prompt types.
  • TSLLMs use projected embeddings.

Ablation Studies:

  • Tool augmentation (e.g., providing statistical or trend features) yields marginal gains (+0.5 to +1.2 pts), indicating that feature injection alone is insufficient for significant improvements.
  • Visual resolution sweeps and inference-time reasoning ablations uncover dependencies between input fidelity and reasoning performance.
  • Non-reasoning inference modes sharply degrade reasoning, prediction, and decision-making, but leave perception robust, highlighting the importance of chain-of-thought steps for complex queries (Yu et al., 26 Jan 2026).

5. Principal Findings and Analytical Insights

TSRBench yields several robust empirical findings:

  • Scaling laws: Model accuracy for Perception, Reasoning, and Decision-Making tasks scales logarithmically with model size (overall Spearman’s ρ\rho for LLMs: 0.9248; prediction: −0.2415). The scaling law breaks for Prediction, with weak or negative correlations for LLMs (Prediction ρ\rho −0.2415) and VLMs (Prediction ρ\rho −0.2612), indicating a divergence between model capacity and forecasting ability.
  • Semantic/numeric decoupling: Strong semantic reasoning does not guarantee accurate numerical prediction. Prediction tasks exhibit weak or negative inter-dimensional correlations with Perception and Reasoning, suggesting that semantic understanding and numeric extrapolation are distinct competencies.
  • Modality complementarity: Text and visual modalities achieve similar overall accuracy but excel on disjoint task subsets (low intersection, high union). However, combined modalities (T+V) do not provide reciprocal performance gains, as correct answers are largely overlapped with the better single-modality baseline, exposing the weak fusion ability of current multimodal models.
  • Difficulty regimes: Abductive Reasoning and Event Prediction are high-variance, moderate-accuracy tasks, suggesting suitability for knowledge distillation. Time Series Forecasting and Quantitative Decision-Making have uniformly low accuracy and variance, implicating the necessity for enriched quantitative pre-training.
  • Tool augmentation: Marginal benefits indicate the potential value, but limited sufficiency, of augmenting inputs with engineered features.
  • Computation allocation: Deliberative reasoning remains critical for complex tasks; rapid inference significantly harms outcome except for perception-oriented queries.

6. Future Directions and Benchmark Extensions

TSRBench highlights multiple critical open research directions:

  • Multi-modal fusion: Advancements in alignment and cross-attention for integrating high-resolution visual trends with semantic context may unlock true reciprocal performance gains.
  • Time series foundation models: Large-scale pre-training on heterogeneous real and synthetic time series is needed, particularly to address the observed numeric forecasting deficits.
  • Modular/multi-agent reasoning frameworks: Delegating subtasks—e.g., perception, logical/deductive reasoning, domain retrieval—to specialized sub-agents may enhance capability compositionality.
  • Adaptive reasoning strategies: Structured chain-of-thought planning, self-verification, and inference-time dynamic compute allocation warrant investigation for reasoning-intensive applications.
  • Benchmark evolution: Future expansions may incorporate continuous forecasting regimes, sequence-to-sequence prediction, real-time/streaming analyses, and adversarial or stress-test scenarios to further stress generalist model limits.

7. Significance and Impact

TSRBench provides the first unified rigorous evaluation platform for generalist models’ time series reasoning, filling a major deficit in current benchmarking ecosystems. Its task diversity, multi-domain reach, multi-modal support, and analytic depth enable nuanced comparisons of model architectures and capabilities across perception, reasoning, forecasting, and decision-making. A plausible implication is that the insights enabled by TSRBench can direct advancements in both architectural innovation and training paradigm design for next-generation generalist models and domain-specialized AI systems (Yu et al., 26 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TSRBench.