CSM-MTBench: Cross-Modal Temporal Benchmark
- CSM-MTBench is a comprehensive benchmark designed to evaluate LLMs' ability to integrate structured time series with natural-language narratives.
- The benchmark features rigorously aligned datasets in finance and weather, supporting tasks like forecasting, trend analysis, and cross-modal question answering.
- Evaluation shows that adding narrative context boosts short-term forecasting, yet challenges remain in long-horizon causal inference and modality fusion.
CSM-MTBench is a large-scale benchmark specifically designed to evaluate models on cross-modal temporal reasoning tasks involving both structured time-series data and unstructured textual narratives. It targets settings, such as finance and weather, where integrating quantitative temporal records with natural-language context is required for real-world decision-making. CSM-MTBench provides rigorously aligned datasets, multi-format evaluation tasks, and a unified framework for assessing deep temporal, causal, and semantic understanding in LLMs, marking a significant advance over modality-isolated benchmarks (Chen et al., 21 Mar 2025).
1. Motivation and Design Objectives
CSM-MTBench aims to fill critical gaps in multimodal evaluation for models that must reason over temporally indexed data and natural language. Existing benchmarks often treat text as auxiliary metadata or restrict themselves to unimodal prediction, failing to test whether models can jointly parse, align, and reason across modalities. CSM-MTBench addresses these limitations by supplying paired time-series and text samples, task protocols that stress cross-modal inference, and a challenging set of domains where causal relationships and reasoning about trend evolutions are essential. The design explicitly targets key difficulties such as causal inference, long-term dependency modeling, cross-modal alignment, and precise output format adherence (Chen et al., 21 Mar 2025).
2. Data Domains and Alignment Strategy
CSM-MTBench covers two principal domains: finance and weather. In finance, the dataset comprises over 200,000 news URLs scraped between May 2021 and September 2023 from financial media outlets. After extensive filtering, 20,000 news articles are retained, each annotated for content type, temporal effect range (backward, present, forward-looking), and sentiment polarity. Each article is paired with a time series of historical open prices for the relevant stock ticker, with two temporal alignments: short-term (7 days at 5-minute intervals for input, forecast next day) and long-term (30 days at 1-hour intervals, forecast next 7 days). Weather data consist of storm event narratives—grouped and LLM-generated as needed—matched to hourly temperature series near 50 U.S. airports, with strict preprocessing for time alignment and missing data interpolation. For each domain, splits are created to ensure robustness to sentiment-time series (TS) consistency and to facilitate both regression and classification tasks (Chen et al., 21 Mar 2025).
3. Benchmark Task Definitions
CSM-MTBench introduces a diverse array of tasks, all supported in TS-only and TS+Text input configurations:
- Time-Series Forecasting: Given a sequence of timestamped numeric values (with or without aligned text), models regress the future values. Performance is measured using mean squared error (MSE), mean absolute error (MAE), and for finance, mean absolute percentage error (MAPE):
- Semantic Trend Analysis: Models classify the overall direction or magnitude of change for a time series, using pre-defined bins based on percent change (finance) or daily mean slope difference (weather). Label sets are 3-way or 5-way with domain-specific thresholds. Accuracy is used for evaluation.
- Technical Indicator Prediction: In finance, this includes regression on indicators such as MACD (EMA EMA) and the upper Bollinger Band (); in weather, on values like next-day temperature extrema and differential. Metrics: MSE and MAE.
- News-Driven Question Answering:
- Correlation Prediction: Models receive a 30-day TS plus a news article and predict the sentiment-aligned direction or strength of correlation in both 3-way (positive/neutral/negative) and 5-way (strong+/mod+/none/mod–/strong–) categories. Accuracy is reported.
- Multiple-Choice QA: Given TS, a narrative, and four statements, the task is to select the correct answer, probing deep cross-modal comprehension.
4. Evaluation Protocols and Dataset Statistics
The dataset is structured to enable model comparison under controlled, large-scale settings. Financial data include 20,000 articles with two aligned time-series samples per article, totaling 40,000 TS–text pairs. The weather dataset contains 2,000 TS–text pairs, derived from 50 stations with 40 aligned narrative–series episodes each. Standard splits of 70% train, 15% validation, and 15% test are used per domain and per task; no nested cross-validation is performed, and all experiments report on held-out test splits. Preprocessing includes stringent timezone adjustment, missing data discarding (over 70% missing), hourly averaging, and linear interpolation. Financial articles are annotated via GPT-4o, and weather narratives are LLM-generated when necessary (Chen et al., 21 Mar 2025).
The following table summarizes key dataset characteristics:
| Domain | # Text–TS Pairs | Label/Target Types |
|---|---|---|
| Finance | 40,000 | Regression, Classification |
| Weather | 2,000 | Regression, Classification |
5. Baseline Models and Results
Six baseline LLMs are evaluated: GPT-4o, Claude-Sonnet-3.5, Gemini-Flash, LLaMA 3 (8B), DeepSeek-Chat, and OpenAI-o1 (finance only). All models use zero-shot prompting with hyperparameters tuned per task. Representative results include:
- Forecasting (Finance, 7d→1d, MAE): OpenAI-o1 achieves the lowest MAE (0.982 for TS+text), followed by Claude (1.422). Adding text reduces MAE by approximately 9.8% in finance and 6.6% in weather, confirming a multimodal boost for LLMs that leverage narrative context.
- Semantic Trend Accuracy (Finance, 5-way classification): OpenAI-o1 reaches 54.4% accuracy (7-day TS+text), outperforming others.
- News-Driven QA (multiple-choice, finance, 7d): DeepSeek-Chat (77.6%) and Claude (75.6%) attain the highest accuracies.
Performance consistently degrades on longer-term forecasting (e.g., 30-day inputs), indicating that long-range temporal modeling remains a challenge. The impact of adding text is nuanced; for some retrospective trend tasks, it can degrade accuracy, suggesting that cross-modal fusion is not universally effective. For correlation prediction, models often default to moderate-positive labels, revealing difficulty distinguishing true negatives or strong correlations. Precise adherence to output format (e.g., correct output sequence lengths) is also problematic (Chen et al., 21 Mar 2025).
6. Key Findings and Methodological Implications
CSM-MTBench reveals substantive gaps in current LLMs' ability to perform robust, faithful cross-modal temporal reasoning:
- Incorporating narrative context yields clear performance gains in standard TS regression, especially on short horizons.
- Substantial challenges remain for causal inference, long-horizon temporal dependencies, and semantic-trend classification.
- LLMs exhibit modality fusion limitations, especially when required to parse nuanced cross-references or temporal shifts in textual data.
- Output format inconsistencies and correlation prediction bias are persistent.
A plausible implication is that future research should pursue more explicit multimodal architectures, e.g., cross-attention schemes between text and TS embeddings, rather than relying solely on prompting. Pretraining regimes tuned for temporal reasoning, as well as probabilistic forecasting and uncertainty quantification, are recommended as further work. The expansion of CSM-MTBench to other application domains—healthcare, energy, social sciences—and increased dataset diversity are suggested next steps to further stress-test and generalize temporal cross-modal modeling (Chen et al., 21 Mar 2025).
7. Significance and Future Directions
CSM-MTBench establishes a comprehensive testbed for cross-modal, temporal reasoning in real-world contexts that require joint natural-language and quantitative TS understanding. Its granular task definition, strict alignment procedures, and multi-aspect evaluation metrics provide a robust platform for benchmarking advancements in multimodal LLMs and related architectures. The structure and results of CSM-MTBench have immediate relevance for the development of LLM-based systems in industries where decision-making depends on the interplay between textual information flow and numerical data—financial forecasting, risk analytics, climate modeling, and beyond. Continued refinement of both benchmark and model architectures is anticipated to drive progress in multimodal, temporally aware AI research (Chen et al., 21 Mar 2025).