- The paper proposes a rubric-based LLM-as-a-Judge framework that assigns a ternary factuality label to time series explanations by verifying numeric evidence.
- It details multiple evaluation tasks—including generation, ranking, scoring, and multi-anomaly detection—to benchmark model performance on structured queries.
- The study reveals that LLMs excel as evaluators over generators, underscoring the benefits of structured, data-grounded prompts for factual verification.
LLM-as-a-Judge for Reference-Free Evaluation of Time Series Explanations
Factual evaluation of natural language explanations for time series data is a critical but underexplored problem. LLMs are increasingly deployed to interpret structured signals in fields such as forecasting and anomaly detection, yet assessing whether generated explanations are grounded in numerical evidence remains a core challenge. Traditional metrics—BLEU, ROUGE, embedding-based similarity—and natural language inference approaches rely on reference answers, neglecting direct verification against raw data. Moreover, numerical methods in time series analysis cannot assess arbitrary textual explanations. Consequently, there is a need for reference-free, data-grounded methods that directly evaluate whether an explanation faithfully reflects underlying numerical patterns.
This work proposes and systematizes the LLM-as-a-Judge (LLM-J) paradigm for the time series explanation setting. Given a time series, a natural language query, and an explanation, models must assign a ternary factuality label (incorrect, partially correct, correct) by reasoning over pattern identification, numeric consistency, and answer faithfulness. The evaluation framework is operationalized via synthetic benchmarks and tasks that expose both generative and evaluative behaviors of LLMs.
Evaluation Framework and Methodology
The proposed framework incorporates structured, rubric-guided prompting enabling direct consistency checks between explanations and time series data. The absence of reference explanations at evaluation time enforces a fully reference-free regime, pushing beyond text-to-text overlap metrics.
Model assessment spans four tasks:
- Explanation Generation: The model generates explanations given the time series and query; correctness is assessed via the rubric.
- Relative Ranking: The model ranks a set of plausible explanations (ground truth correct, partially correct, incorrect) for a given instance.
- Independent Scoring: Single explanations are evaluated in isolation for factual consistency.
- Multi-Anomaly Detection: The model identifies all anomalies in a series, reporting indices and percentage changes, without prespecified detection thresholds.
Ground-truth is available only for experimental reporting; during model inference, only the time series, question, and candidate explanations are provided.
To ensure comprehensive evaluation, a synthetic benchmark (TSQueryBench) comprising 350 time series spanning seven query types (including break, extremum, volatility, multi-metric, shift patterns) is introduced. Each instance is coupled with curated explanations at all three correctness levels, facilitating controlled analysis across tasks.
Experimental Design and Implementation
Three open-weight LLMs (Qwen-3 8B, LLaMA-3 8B, Gemma-2 9B IT) are employed, tested in both generation and evaluation roles. Prompts are standardized, and no task-specific fine-tuning is performed (zero-shot configuration). Time series of varying lengths (100-500) are included to measure scalability.
Evaluation metrics per task are as follows:
- Explanation Generation: Accuracy and fine-grained error categories (numeric, logical, unsupported claims).
- Relative Ranking / Independent Scoring: Accuracy versus ground-truth factuality for candidate explanations.
- Multi-Anomaly Detection: Count accuracy and F1 for anomaly localization.
All evaluations are automated, but generation outputs in the first experiment are also validated via human expert annotation to ensure robustness.
Results: Generation–Evaluation Asymmetry and Pattern Sensitivity
Explanation Generation
Performance is highly query-dependent. Qwen-3 8B achieves strong accuracy ($0.94$–$0.96$) on 'structural break', 'linear spike', and 'mean shift' queries, but all models fail completely on 'volatility shift'. For challenging types such as 'relative extremum' and 'seasonal drop', accuracies fall between $0.00$ and $0.82$, with most errors categorized as numeric inconsistencies or pattern misidentification.
Evaluation Tasks: Relative Ranking and Independent Scoring
Models exhibit significantly more stable and reliable performance when assessing candidate explanations than when generating them. Qwen-3 8B, in particular, consistently surpasses $0.90$ accuracy in ranking and $0.71$–$0.78$ in independent scoring, regardless of sequence length or query complexity. Remarkably, models successfully distinguish and prefer the factually correct explanations even in cases (e.g., volatility shift) where their own generation fails.
Multi-Anomaly Detection
All models demonstrate low accuracy in estimating the true count of anomalies but moderate F1 for anomaly detection (up to $0.57$–$0.62$), revealing a tendency toward over-detection (high recall, low precision). The absence of an explicit threshold for anomalies forces models to rely on internal heuristics, often resulting in spurious boundary classifications.
Generation versus Evaluation
The observed asymmetry—LLMs being markedly better evaluators than generators—highlights a structural limitation in the current models' capacity for open-ended grounded reasoning. It also demonstrates the utility of rubric-based, structured prompt design for factual verification in numerically grounded domains.
Theoretical and Practical Implications
This study underscores that controllable, rubric-guided prompting enables LLMs to act as effective data-grounded evaluators for complex numerical data explanations, achieving reliability where generative capabilities remain brittle. The generation–evaluation gap suggests that evaluators can be decoupled from generators in LLM-based systems for trustworthy explanation pipelines, especially in high-stakes environments where factual grounding is paramount.
The synthetic TSQueryBench and multi-task methodology provide a testbed for advancing model-based factuality evaluation that is not dependent on reference explanations, benchmarking current LLMs' limitations in numeric reasoning stability and pattern generalization—particularly for higher-order structures such as volatility and distributional change.
Future Directions
Key directions include adapting the framework to real-world data featuring richer noise, confounding, and longer sequences, as well as integrating symbolic or retrieval-augmented verification methods for enhanced quantitative reliability. Exploration of hybrid evaluator systems—combining LLM and explicit statistical tests—may yield improved robustness. Moreover, targeted LLM instruction fine-tuning centered on numeric verification and calibration may close the observed generation–evaluation asymmetry.
Conclusion
This work formalizes and systematizes a rubric-based, reference-free LLM-as-a-Judge paradigm for evaluating time series explanations, showing that current LLMs are more reliable as evaluators than generators in data-grounded settings (2604.02118). The results delineate the contours of current numeric reasoning capabilities and highlight structured prompting as essential for robust factual verification in LLM-driven analytics pipelines.