iTransformer Ablation Study Insights

Updated 23 January 2026

The paper demonstrates that ablating key components like Series Attention and self-supervised pretext tasks significantly degrades forecasting accuracy (notably in MSE and MAE metrics).
It reveals that dual-branch attention and point-to-point data (ride-hailing, taxi) are essential for precise spatio-temporal parking availability prediction.
The study quantifies the impact of removing individual modules, highlighting the practical benefits of incorporating spatial neighbor histories in urban forecasting.

The iTransformer ablation study is a comprehensive empirical evaluation designed to quantify the contribution of core architectural modules and data sources in the context of spatio-temporal parking availability forecasting. Focusing on the Self-Supervised Spatio-Temporal iTransformer (SST-iTransformer), the study provides a systematic breakdown of model variants by selectively disabling or removing individual components (attention branches, pretext tasks, data modalities, and spatial context) and measuring the resultant degradation in forecasting accuracy. Central findings include the vital role of dual-branch attention, the primacy of point-to-point mobility data sources (ride-hailing, taxi), and the measurable impact of self-supervised pretraining and neighbor spatial histories. The analysis is grounded in one-day-ahead prediction experiments with real-world multi-source data from Chengdu, China, and uses mean squared error (MSE) and mean absolute error (MAE) as evaluation metrics (Huang et al., 4 Sep 2025).

1. Definitions of Baseline and Ablated Variants

The ablation suite encompasses a set of distinct model and data configurations, each representing a targeted removal of a specific capability or information channel. The definitions are summarized in the table below:

Variant Name	Component/Data Removed	Description
Vanilla iTransformer	Patch embedding, self-supervised pretext tasks	Original inverted Transformer; no self-supervised learning or patching.
iTransformer-patch	Self-supervised pretext tasks	Dual-branch, patched; trained only via supervised loss, without masking-reconstruction tasks.
SST-iTransformer (full)	None	Full architecture: dual-branch attention, patch embedding, self-supervised learning, multi-source and spatial neighbor data.
No-SeriesAttention	Series Attention branch	Series Attention branch removed; only Channel Attention remains (trained with all data and pretext tasks).
No-ChannelAttention	Channel Attention branch	Channel Attention branch removed; only Series Attention remains (trained with all data and pretext tasks).
Exclude ride-hailing (F-R)	Ride-hailing demand features	Trained without ride-hailing mode features.
Exclude taxi (F-T)	Taxi demand features	Trained without taxi mode features.
Exclude bus (F-B)	Bus demand features	Trained without bus mode features.
Exclude metro (F-M)	Metro demand features	Trained without metro mode features.
No spatial neighbors	Neighbor-lot historical series within PCZ	Trained using only target lot's history and direct demand data (TPL paradigm); omits spatial context from neighboring lots.

Each ablated variant isolates and quantifies the impact of the respective mechanism, enabling precise attribution of forecasting improvements.

2. Quantitative Performance Results

The study employs the following evaluation metrics for forecasting accuracy over a one-day-ahead (144-step) horizon, with formulas given by:

$\mathrm{MSE} = \frac{1}{N} \sum_{i} (y_{i} - \hat{y}_{i})^2$
$\mathrm{MAE} = \frac{1}{N} \sum_{i} |y_{i} - \hat{y}_{i}|$

Core results for each variant are as follows:

Variant	MSE	MAE
Vanilla iTransformer	0.3455	0.3577
iTransformer-patch	0.3321	0.3546
SST-iTransformer (full)	0.3293	0.3523
No-SeriesAttention	0.3580‡	0.3665‡
No-ChannelAttention	0.3425‡	0.3601‡
Exclude metro (F-M)	0.3328	0.3570
Exclude bus (F-B)	0.3531	0.3521
Exclude ride-hailing (F-R)	0.3728	0.3806
Exclude taxi (F-T)	0.3610	0.3742
No spatial neighbors	0.3422	0.3527

‡ The No-SeriesAttention and No-ChannelAttention results are consistent with a reported 10–15% relative degradation for attention branch removals.

These outcomes demonstrate that every principal axis ablated in SST-iTransformer induces a nontrivial loss in performance, with particularly acute sensitivity to temporal modeling, ride-hailing data, and spatial neighbor inputs.

3. Component-Wise Impact Analysis

Self-Supervised Pretext Tasks

Transitioning from iTransformer-patch to SST-iTransformer, the inclusion of masking-reconstruction-based pretraining yields improved generalization (MSE decrease from 0.3321 to 0.3293, MAE from 0.3546 to 0.3523). This demonstrates that unsupervised context learning, even with modest absolute effect (~1% MSE reduction), facilitates the extraction of latent spatio-temporal structures attuned to downstream parking prediction.

Dual-Branch Attention Architecture

Ablating Series Attention (retaining only Channel Attention) degrades MSE by ∼8.8% (0.3293 to 0.3580), while ablating Channel Attention (retaining only Series Attention) yields ∼4.0% deterioration (0.3293 to 0.3425). Corresponding MAE increases are 4.0% and 2.2%, respectively. Thus, Series Attention—implemented with patching to model long-range dependencies—is especially critical for capturing volatility in parking demand, whereas Channel Attention supports cross-variate interactions but is second in relative importance.

Multi-Source Demand Data

Performance is most sensitive to the exclusion of ride-hailing demand (MSE ↑13.2%, MAE ↑8.0%) and taxi demand (MSE ↑9.6%, MAE ↑6.2%), indicating that these point-to-point travel modes tightly correlate with parking turnover. In contrast, removing bus or metro features yields moderate (bus: MSE ↑7.2%, MAE ↑0.6%) or marginal (metro: MSE ↑1.1%, MAE ↑1.3%) degradation, highlighting limited coupling between fixed-route transit and parking behaviors.

Spatial Neighbor Information

Omitting all historical data from neighboring lots within the same PCZ (i.e., “no spatial neighbors”) results in a clear performance drop (MSE from 0.3293 to 0.3422, MAE from 0.3523 to 0.3527; +3.9% MSE). This empirically confirms that spatial correlation modeling—explicitly leveraging adjacent lot histories—delivers complementary signals augmenting the predictive model.

4. Broader Methodological Implications

The ablation study elucidates methodological guidelines for the design of spatio-temporal forecasting models:

Self-supervised pretraining, specifically masking-reconstruction, harnesses unlabeled time-series data to learn global contextual representations, enhancing generalization, especially in regimes with noise or label sparsity.
Dual-branch attention architectures with segregated Series and Channel modules support both long-range temporal dependency modeling and cross-variate interactions. The criticality of Series Attention for highly non-stationary targets such as parking demand is evident.
Selective multi-source data fusion is necessary—point-to-point demand modes (ride-hailing, taxi) are highly discriminative, whereas fixed-route modes offer limited incremental value for the parking prediction task. This finding can guide feature engineering in analogous urban mobility problems.
Explicit spatial neighbor modeling (via clustering approaches like PCZs) provides measurable benefit over approaches that omit such inter-entity correlations or attempt to encode all spatial relations via dense graph neural networks. Hybrid clustering-attention strategies may offer scalable solutions for other urban analytics challenges.

5. Significance for Spatio-Temporal Modeling

The iTransformer ablation study establishes that state-of-the-art performance in parking availability forecasting arises from the careful integration of all examined components. Each module—self-supervised pretext learning, multi-branch attention, high-salience data modalities, and spatial neighbor histories—contributes essential inductive bias and predictive power. These empirical findings suggest design priorities for future Transformer-based architectures in spatio-temporal learning: embracing multi-axis attention, utilizing targeted self-supervised tasks, prioritizing informative auxiliary signals, and explicitly modeling spatial adjacency. Such architectural principles have relevance for a broad class of forecasting and data fusion applications in urban informatics, transportation, and dynamic resource allocation domains (Huang et al., 4 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Parking Availability Prediction via Fusing Multi-Source Data with A Self-Supervised Learning Enhanced Spatio-Temporal Inverted Transformer (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to iTransformer Ablation Study.