MTBench101: Multimodal Temporal Benchmark
- MTBench101 is a comprehensive multimodal benchmark that integrates high-frequency time series with aligned narratives for joint temporal reasoning.
- It challenges models with tasks including forecasting, semantic trend analysis, technical indicator prediction, and news-driven question answering.
- The benchmark rigorously assesses models’ ability to disentangle causal signals from noise using real-world financial and weather data.
MTBench101 (Multimodal Time-series Benchmark 101) is a large-scale, multi-task testbed specifically designed to evaluate and benchmark the capabilities of LLMs in joint temporal reasoning and narrative understanding, focusing on real-world financial and weather domains. Unlike prior unimodal benchmarks or those limited to isolated tasks, MTBench101 probes LLMs for their proficiency in complex temporal forecasting, semantic interpretation of trends, technical indicator prediction, and narrative-driven question answering, using tightly aligned numerical time series and free-form text pairs. By curating both “helpful” and “misleading” narrative–series instances, the benchmark provides a rigorous and granular assessment of a model’s ability to leverage, disambiguate, and reason over multimodal evidence (Chen et al., 21 Mar 2025).
1. Motivation and Scope
Existing multimodal benchmarks generally fall short in cross-modal temporal reasoning, typically evaluating text or time series in isolation and failing to capture their complex interactions. MTBench101 directly targets this gap by integrating structured, high-frequency time series with semantically annotated narratives, challenging models not only to forecast but to explain, classify, and interrogate the evolving relationships between events and quantitative behavior across two critical domains:
- Finance: High-frequency (minute/hourly) historical stock prices paired with contemporaneous professional news articles (~20,000 pairs).
- Weather: Hourly temperature records from 50 U.S. airports paired with event-centric meteorological reports or synthesized storm narratives (~2,000 pairs).
MTBench101 encompasses four foundational tasks: time-series forecasting, semantic trend analysis, technical indicator prediction, and news-driven question answering. Narrative–series pairs are intentionally curated to stress causal reasoning, filter misleading information, and encourage context-dependent interpretation. This framework directly addresses the challenge of modeling temporal dependencies under narrative modulation and forms a unified testbed for temporal AI in a multimodal context (Chen et al., 21 Mar 2025).
2. Dataset Composition and Organization
Each dataset entry consists of:
- Numerical Time Series: , where is the sequence length and is the channel count (typically for price or temperature). Finance settings include short-term (7 days 288 five-minute windows) and long-term (30 days 24 hourly), while weather settings cover 7- or 14-day hourly sequences.
- Aligned Narratives: Financial news articles are selected from a 200,000-article corpus (MarketWatch, SeekingAlpha, etc.) and annotated for content category, temporal effect (“forward-looking”, etc.), and sentiment (bullish, neutral, bearish) using GPT-4o. Weather narratives derive from severe event databases or are LLM-generated when ground truth is unavailable; each is anchored to a specific timeline for precise series extraction.
Data statistics:
| Domain | Pairs | Series Lengths | Narrative Source / Annotation |
|---|---|---|---|
| Finance | 20,000 | 7d@5m, 30d@1h | MarketWatch, SeekingAlpha / Sentiment, category, time |
| Weather | 2,000 | 7d@1h, 14d@1h | NOAA, NWS, synthesized / Event type, locality |
A standard 80/10/10 train/validation/test split is recommended, facilitating reproducible cross-model and cross-task benchmarking (Chen et al., 21 Mar 2025).
3. Formal Task Definitions and Metrics
MTBench101 covers four integrated tasks:
a) Time-Series Forecasting:
Given historical sequence and optional narrative , predict the future series . Example: 7 days of five-minute price data plus news, with the target being the next day's full price trajectory. Metrics:
b) Semantic Trend Analysis:
Determine discrete trend categories based on series delta, with and without access to narrative :
- Labels: 3-way ({Negative, Neutral, Positive}) or 5-way ({Strong Bearish, Bearish, Neutral, Bullish, Strong Bullish})
- Metric: Accuracy; F1 for imbalanced bins:
c) Technical Indicator Prediction:
Finance:
- MACD:
- Bollinger: Weather:
- Metrics: MAE, MSE
d) News-Driven Question Answering (QA):
- Correlation classification: Map sentiment-to-price correlation (5-way or 3-way)
- Multiple-choice QA: Select correct causal or trend-based statement from given options Metric: Accuracy
4. Representative Examples
- Finance/Forecasting:
Input: 7 days × 288 five-minute price series; “Company X secures $100M order.” Output: Next day’s 288 prices. Rationale: The news item provides a direct narrative signal for an overnight bullish movement.
- Trend Analysis (30d, 5-way):
Input: 30 days × 24 hour prices, no text. Output: “Growth-Oriented (2%–4%)” (with computed $\Delta\approx 3.1\%T_\mathrm{max}=30.2^\circT_\mathrm{min}=18.5^\circ\Delta T=11.7^\circ$C.
- News-driven QA:
Question: Which statement is correct? A: “Positive analysis → strong immediate gain” B: “Analysis sentiment and 7-day decline are positively correlated.” Output: B, justified by alignment of price series and news tone.
5. Empirical Findings and Baseline Results
Models such as GPT-4o, Claude 3.5, Gemini 2.0, LLaMA 3.1-8B, DeepSeek-Chat, and OpenAI-o1 are evaluated on all tasks. Key observations include:
- Forecasting: Text narratives provide substantial gain, reducing MAE by 9.78% (finance) and 6.63% (weather).
- Trend Analysis: Access to narratives increases accuracy in 25/28 settings; rare cases exhibit modality interference (suggesting over-reliance on irrelevant narrative cues).
- Technical Indicator Prediction: Bollinger band forecasting benefits more from text than MACD, highlighting the impact of volatility-related news.
- QA Tasks: Correlation classification accuracy peaks at 60% for 30-day horizons, 50% for 7-day; MCQA accuracy ranges from 40–77%. Models display a confusion tendency toward moderate correlations, underscoring ongoing challenges in robust multimodal reasoning.
Principal difficulties include the consistent capture of long-term dependencies, distinguishing true causal from coincidental narrative–series associations, and adherence to strict output format in multimodal response settings (Chen et al., 21 Mar 2025).
6. Key Challenges and Interpretative Insights
Findings from MTBench101 highlight three persistent challenges:
- Capturing long-term and compounding temporal dependencies: Many models falter at integrating narrative events that manifest over extended horizons, occasionally resorting to superficial or default trend predictions.
- Disentangling causality from mere correlation: The careful construction of helpful and misleading narrative–series pairs reveals that alignment between narrative sentiment and price trends can be spurious, and LLMs routinely conflate the two.
- Consistent and robust multimodal fusion: Errors relating to output formatting, filtering irrelevant narrative noise, and over-interpolation of text cues demonstrate that current architectures do not yet fully exploit cross-modal synergies.
A plausible implication is that purpose-engineered multimodal transformers and novel cross-attention mechanisms may be necessary for substantial improvement (Chen et al., 21 Mar 2025).
7. Significance and Future Directions
MTBench101 establishes a unified, extensible platform for systematic evaluation of joint temporal and narrative reasoning—a critical frontier in AI for finance, meteorology, and beyond. Its curated, real-world-aligned dataset and multi-granular tasks directly foster advances in:
- Model architectures for temporal–textual fusion
- Causality-aware reasoning modules
- Benchmark-driven diagnosis of multimodal learning failure cases
Future updates will expand to additional modalities (e.g., audio, video), more granular event annotation, and harder distractor generation to further stress-test emerging models as multimodal and causal reasoning capabilities advance (Chen et al., 21 Mar 2025).