UniPredict Benchmark Evaluation

Updated 7 February 2026

UniPredict Benchmark is a family of standardized evaluation environments that assess model generalization across tabular, urban spatial–temporal, and long-term forecasting domains.
It consolidates 169 curated tabular datasets, interoperable urban data files, and diverse time-series collections, enabling rigorous cross-method comparisons using unified metrics.
The benchmark enhances reproducible research by standardizing task design, enforcing contamination audits, and providing clear metadata for consistent evaluation.

The UniPredict Benchmark encompasses a suite of standardized tasks, datasets, and evaluation protocols that serve as foundational platforms for rigorous empirical research across three domains: universal tabular classification using LLMs, unified urban spatial–temporal forecasting, and long-term multivariate time-series prediction. Each variant—originating from distinct research groups—anchors itself in a unified vision: the construction of evaluation environments where model generalization, benchmarking consistency, and cross-method comparisons are enabled at scale. The term “UniPredict Benchmark” thus refers not to a single artifact but to a family of influential resources, each of which has shaped standards and practices in its subfield.

1. Universal Tabular Classification Benchmark

The UniPredict tabular benchmark (Wang et al., 2023) operationalizes the concept of a universal tabular classifier by aggregating 169 curated datasets from Kaggle, spanning domains including finance, healthcare, retail, entertainment, education, and environment. Each dataset is strictly capped at 7,500 samples to prevent single-source dominance; collectively, the corpus comprises 366,786 samples with feature counts ranging from 3 to 40 (numerical, categorical, binary, textual).

Targets include:

Binary and multi-class classification: Traditional supervised settings.
Continuous regression (converted to quartile classification): Continuous outcomes are quantized by empirical quartiles, resulting in balanced four-class prediction tasks.

For each dataset, metadata (dataset-level and column-level descriptions) generated by GPT-3.5 is reformatted and provided in the prompt. Table rows are serialized as semantically structured text (“Col₁ is v₁; Col₂ is v₂; ...”), with instructions prompting the model to “Predict the probability of each class by: class 0 means ...; class 1 means ...”. Ground-truth labels are replaced by isotonic‐calibrated XGBoost outputs, yielding soft label distributions.

Evaluation Protocols

Metrics: Accuracy $(N^{-1} \sum_{i=1}^{N} \mathbb{I}(\hat{y}_i = y_i))$ , AUC (standard ROC integral), and model-rank distributions.
Data splits: 90% train / 10% test, stratified by class.
Few-shot regime: 62 new tabular datasets (<100 samples each) with $r \in \{0.1, 0.3, 0.5, 0.7, 0.9\}$ training-share; UniPredict is fine-tuned for up to 30 epochs per few-shot setting.

Baselines

XGBoost: Per-dataset, ordinal encoding, $n_{\mathrm{estimators}}=100$ , max-depth 6, learning rate 0.3.
MLP: scikit-learn, one hidden layer, ReLU, Adam ( $\mathrm{lr}=10^{-3}$ ), max-iter 100.
TabNet, FT-Transformer, TabLLM: Each trained from scratch per dataset using official or published configurations.

UniPredict uniquely trains a single GPT-2 backbone across all datasets with prompt-based instructions and soft labels, unlike per-task discriminative fitting.

Quantitative Results

Setting	UniPredict-heavy	XGBoost	FT-Transformer
Universal (acc)	0.721	0.684	0.636
Median (acc)	0.810	0.788	—
Few-shot (r=10%)	0.525	0.240	0.312
Few-shot (r=90%)	0.543	0.620	—

In universal modeling, UniPredict-heavy outperforms XGBoost by +5.4% relative accuracy and FT-Transformer by +13.4%. In the few-shot regime at $r=0.1$ , UniPredict surpasses XGBoost by >100% relative gain.

Practical Insights and Limitations

Scalability is demonstrated by a single LLM serving as a universal classifier across hundreds of tables. Adaptability is shown via strong few-shot learning behavior. Challenges remain: context window overflow for wide tables or verbose columns, degradation with ambiguous metadata, and prompt length limits imposed by GPT-2. Robustness also depends on meticulous data preprocessing, including column cleaning and clear descriptions (Wang et al., 2023).

2. Unified Benchmark for Urban Spatial–Temporal Prediction

UniPredict for urban spatial–temporal prediction (Jiang et al., 2023) addresses the complexity of accessing and comparing urban datasets (traffic, air quality, event signals) with differing granularities, formats, and schema.

Data Layer: Atomic Files

Urban datasets (N=40, 17 cities) are decomposed into interoperable atomic files:

.geo: Geographical units with geo_id, type, and coordinates.
.rel: Relations (adjacency or OD flows) between units.
.dyna/.grid/.od/.gridod: Dynamics of entities, grids, or flows.
.ext: External data (weather, holiday events).

Data tensors include graphs ( $\mathbf{X} \in \mathbb{R}^{T \times N \times D}$ ) and grids, supporting time-series learning for both sensor and spatial grid scenarios.

Model Families

Eighteen canonical models organized as follows:

General time-series: FNN, Seq2Seq, AutoEncoder.
Sequential structure: Spatial-CNN + Temporal (e.g., ST-ResNet, DMVSTNet), Spatial-GCN + Temporal (STGCN, GWNET, MTGNN, TGCN), Spatial-Attention + Temporal (GMAN, ASTGCN, STTN).
Coupled structure: DCRNN, AGCRN.
Synchronous learning: STG2Seq, STSGCN, D2STGNN.

Metrics and Leaderboard

Standardized MAE, RMSE, and MAPE are reported. On overall average, the top five models are:

Model	Rank
D2STGNN	1
MTGNN	2
GWNET	3
AGCRN	4
GMAN	5

D2STGNN outperforms alternatives by 3–7% MAE on critical benchmarks. WaveNet-based models benefit from superior receptive field properties, while attention-based models excel at modeling dynamic dependencies but with a computational cost. The comprehensive leaderboard and cross-model analysis support robust development and fair benchmarking within the urban prediction community (Jiang et al., 2023).

3. Long-Term Time-Series Forecasting Benchmark

The UniPredict long-term time-series forecasting benchmark (Cyranka et al., 2023) encapsulates the challenge of evaluating models across both real-world and synthetic sequential domains with trajectories up to 2,000 steps.

Datasets

Real-world: Electricity load, ETT, weather, METR-LA traffic (all regularly sampled).
Simulated/scientific: MuJoCo dynamics (Half-Cheetah, Hopper, Walker2D), Kuramoto–Sivashinsky PDE, Mackey–Glass.

All series are normalized (zero mean, unit variance) and split into fixed look-back ( $L$ ) and horizon ( $H$ ) segments, e.g., $H=96,192$ (short-term), $H \approx 968$ (long-horizon).

Models

LSTM: Classic recurrent baseline.
DeepAR: Autoregressive RNN with explicit likelihood.
NLinear, Latent NLinear: Learn direct multi-step forecasts; the latter embeds context to a latent space.
N-HiTS: Hierarchical residual network.
PatchTST: Patchized input, standard Transformer encoder, linear prediction head.
LatentODE: Neural ODE in a latent space.

Curriculum learning is introduced for DeepAR, improving long-horizon performance.

Metrics and Findings

MAE, MSE are computed over all predicted values.
PatchTST attains the best MSE/MAPE for short horizons on regular real-world data.
NLinear/Latent NLinear dominate in long-horizon simulation/PDE settings (5–10% advantage).
LatentODE excels with irregular/partially-observed dynamics.

Statistical significance is validated via paired $t$ -tests at $p<0.01$ (Cyranka et al., 2023).

4. Critical Re-Evaluation and Methodological Limitations

A comprehensive re-assessment (Gorla et al., 3 Feb 2026) highlights several pitfalls in tabular LLM evaluation using UniPredict-derived datasets:

Task-type artifacts: Almost all observed gains in Tabula-8B (a Llama-3-8B TLM) originate from quartile classification tasks. Median lift over majority baseline is near zero for both binary (+2.5 pp) and multiclass (–0.3 pp) but +32.9 pp for quartile tasks, which by construction enforce class balance and enable trivial exploitation of correlated features.
Dataset contamination: Systematic train-test overlap and task-level leakage (e.g., entity memorization, factual mapping recall) pervade the highest scoring datasets.
Instruction-tuning confound: General instruction-following ability (Alpaca baseline) recovers 92.2% of Tabula’s mean accuracy on standard classification, and quartile format familiarity accounts for 71.3% of the observed advantage—true tabular reasoning accounts for little of the aggregate performance.

Recommendations include always reporting trivial baselines, stratifying by task type, releasing code/predictions, performing deep contamination audits, and including instruction-tuned tabular-naïve baselines (Gorla et al., 3 Feb 2026).

5. Connections to General LLM Predictive Benchmarking

Although not directly titled “UniPredict,” related frameworks such as PredictaBoard (Pacchiardi et al., 20 Feb 2025) underscore the broader methodological need to evaluate not only performance but also predictability, reliability, and error regions of large foundation models through assessor models, instance-level metrics, and safety-centric protocols, aligning in spirit with the goals of UniPredict to raise empirical standards and transparency.

6. Significance and Evolving Research Directions

The UniPredict Benchmarks, in their various incarnations, have raised the bar for dataset curation, fair and transparent evaluation, and reproducible protocol specification within machine learning. However, emerging analyses indicate that task design, contamination, and artifact sensitivity must be prioritized as core concerns. Future directions highlighted include:

Enlarging scope to dynamic and cross-modality data, including multimodal architectures and online updates.
Enforcing strict anti-contamination procedures (entity-level deduplication, task-type validation).
Expanding few-shot and prompt-based evaluation regimes alongside robust statistical stratification.
Incorporating standard operating procedures for instruction and format control, ensuring that benchmarks genuinely reflect the reasoning capabilities of learning systems and not spurious correlations or data artifacts.

Collectively, the UniPredict Benchmarks both enable large-scale empirical research and provide a cautionary framework for methodological rigor, artifact avoidance, and open scientific scrutiny across the data-driven AI landscape.