LLM-as-a-Judge for Time Series Explanations

Published 2 Apr 2026 in cs.AI and cs.CL | (2604.02118v1)

Abstract: Evaluating factual correctness of LLM generated natural language explanations grounded in time series data remains an open challenge. Although modern models generate textual interpretations of numerical signals, existing evaluation methods are limited: reference based similarity metrics and consistency checking models require ground truth explanations, while traditional time series methods operate purely on numerical values and cannot assess free form textual reasoning. Thus, no general purpose method exists to directly verify whether an explanation is faithful to underlying time series data without predefined references or task specific rules. We study LLMs as both generators and evaluators of time series explanations in a reference free setting, where given a time series, question, and candidate explanation, the evaluator assigns a ternary correctness label based on pattern identification, numeric accuracy, and answer faithfulness, enabling principled scoring and comparison. To support this, we construct a synthetic benchmark of 350 time series cases across seven query types, each paired with correct, partially correct, and incorrect explanations. We evaluate models across four tasks: explanation generation, relative ranking, independent scoring, and multi anomaly detection. Results show a clear asymmetry: generation is highly pattern dependent and exhibits systematic failures on certain query types, with accuracies ranging from 0.00 to 0.12 for Seasonal Drop and Volatility Shift, to 0.94 to 0.96 for Structural Break, while evaluation is more stable, with models correctly ranking and scoring explanations even when their own outputs are incorrect. These findings demonstrate feasibility of data grounded LLM based evaluation for time series explanations and highlight their potential as reliable evaluators of data grounded reasoning in the time series domain.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper proposes a rubric-based LLM-as-a-Judge framework that assigns a ternary factuality label to time series explanations by verifying numeric evidence.
It details multiple evaluation tasks—including generation, ranking, scoring, and multi-anomaly detection—to benchmark model performance on structured queries.
The study reveals that LLMs excel as evaluators over generators, underscoring the benefits of structured, data-grounded prompts for factual verification.

LLM-as-a-Judge for Reference-Free Evaluation of Time Series Explanations

Motivation and Problem Formulation

Factual evaluation of natural language explanations for time series data is a critical but underexplored problem. LLMs are increasingly deployed to interpret structured signals in fields such as forecasting and anomaly detection, yet assessing whether generated explanations are grounded in numerical evidence remains a core challenge. Traditional metrics—BLEU, ROUGE, embedding-based similarity—and natural language inference approaches rely on reference answers, neglecting direct verification against raw data. Moreover, numerical methods in time series analysis cannot assess arbitrary textual explanations. Consequently, there is a need for reference-free, data-grounded methods that directly evaluate whether an explanation faithfully reflects underlying numerical patterns.

This work proposes and systematizes the LLM-as-a-Judge (LLM-J) paradigm for the time series explanation setting. Given a time series, a natural language query, and an explanation, models must assign a ternary factuality label (incorrect, partially correct, correct) by reasoning over pattern identification, numeric consistency, and answer faithfulness. The evaluation framework is operationalized via synthetic benchmarks and tasks that expose both generative and evaluative behaviors of LLMs.

Evaluation Framework and Methodology

The proposed framework incorporates structured, rubric-guided prompting enabling direct consistency checks between explanations and time series data. The absence of reference explanations at evaluation time enforces a fully reference-free regime, pushing beyond text-to-text overlap metrics.

Model assessment spans four tasks:

Explanation Generation: The model generates explanations given the time series and query; correctness is assessed via the rubric.
Relative Ranking: The model ranks a set of plausible explanations (ground truth correct, partially correct, incorrect) for a given instance.
Independent Scoring: Single explanations are evaluated in isolation for factual consistency.
Multi-Anomaly Detection: The model identifies all anomalies in a series, reporting indices and percentage changes, without prespecified detection thresholds.

Ground-truth is available only for experimental reporting; during model inference, only the time series, question, and candidate explanations are provided.

To ensure comprehensive evaluation, a synthetic benchmark (TSQueryBench) comprising 350 time series spanning seven query types (including break, extremum, volatility, multi-metric, shift patterns) is introduced. Each instance is coupled with curated explanations at all three correctness levels, facilitating controlled analysis across tasks.

Experimental Design and Implementation

Three open-weight LLMs (Qwen-3 8B, LLaMA-3 8B, Gemma-2 9B IT) are employed, tested in both generation and evaluation roles. Prompts are standardized, and no task-specific fine-tuning is performed (zero-shot configuration). Time series of varying lengths (100-500) are included to measure scalability.

Evaluation metrics per task are as follows:

Explanation Generation: Accuracy and fine-grained error categories (numeric, logical, unsupported claims).
Relative Ranking / Independent Scoring: Accuracy versus ground-truth factuality for candidate explanations.
Multi-Anomaly Detection: Count accuracy and F1 for anomaly localization.

All evaluations are automated, but generation outputs in the first experiment are also validated via human expert annotation to ensure robustness.

Results: Generation–Evaluation Asymmetry and Pattern Sensitivity

Explanation Generation

Performance is highly query-dependent. Qwen-3 8B achieves strong accuracy ($0.94$–$0.96$) on 'structural break', 'linear spike', and 'mean shift' queries, but all models fail completely on 'volatility shift'. For challenging types such as 'relative extremum' and 'seasonal drop', accuracies fall between $0.00$ and $0.82$, with most errors categorized as numeric inconsistencies or pattern misidentification.

Evaluation Tasks: Relative Ranking and Independent Scoring

Models exhibit significantly more stable and reliable performance when assessing candidate explanations than when generating them. Qwen-3 8B, in particular, consistently surpasses $0.90$ accuracy in ranking and $0.71$–$0.78$ in independent scoring, regardless of sequence length or query complexity. Remarkably, models successfully distinguish and prefer the factually correct explanations even in cases (e.g., volatility shift) where their own generation fails.

Multi-Anomaly Detection

All models demonstrate low accuracy in estimating the true count of anomalies but moderate F1 for anomaly detection (up to $0.57$–$0.62$), revealing a tendency toward over-detection (high recall, low precision). The absence of an explicit threshold for anomalies forces models to rely on internal heuristics, often resulting in spurious boundary classifications.

Generation versus Evaluation

The observed asymmetry—LLMs being markedly better evaluators than generators—highlights a structural limitation in the current models' capacity for open-ended grounded reasoning. It also demonstrates the utility of rubric-based, structured prompt design for factual verification in numerically grounded domains.

Theoretical and Practical Implications

This study underscores that controllable, rubric-guided prompting enables LLMs to act as effective data-grounded evaluators for complex numerical data explanations, achieving reliability where generative capabilities remain brittle. The generation–evaluation gap suggests that evaluators can be decoupled from generators in LLM-based systems for trustworthy explanation pipelines, especially in high-stakes environments where factual grounding is paramount.

The synthetic TSQueryBench and multi-task methodology provide a testbed for advancing model-based factuality evaluation that is not dependent on reference explanations, benchmarking current LLMs' limitations in numeric reasoning stability and pattern generalization—particularly for higher-order structures such as volatility and distributional change.

Future Directions

Key directions include adapting the framework to real-world data featuring richer noise, confounding, and longer sequences, as well as integrating symbolic or retrieval-augmented verification methods for enhanced quantitative reliability. Exploration of hybrid evaluator systems—combining LLM and explicit statistical tests—may yield improved robustness. Moreover, targeted LLM instruction fine-tuning centered on numeric verification and calibration may close the observed generation–evaluation asymmetry.

Conclusion

This work formalizes and systematizes a rubric-based, reference-free LLM-as-a-Judge paradigm for evaluating time series explanations, showing that current LLMs are more reliable as evaluators than generators in data-grounded settings (2604.02118). The results delineate the contours of current numeric reasoning capabilities and highlight structured prompting as essential for robust factual verification in LLM-driven analytics pipelines.