HIVE-COTE 2.0: Time Series Ensemble
- HIVE-COTE 2.0 is a heterogeneous meta-ensemble framework for time series classification that simultaneously enlists diverse classifiers capturing orthogonal representations.
- It advances state-of-the-art performance on benchmarks by integrating modules like TDE, DrCIF, The Arsenal, and STC to deliver enhanced accuracy and calibrated predictions.
- Its flexible architecture supports contracting and efficiency trade-offs while addressing challenges in scalability and computational resource management.
HIVE-COTE 2.0 is a heterogeneous meta-ensemble framework for time series classification, advancing the state-of-the-art through the simultaneous ensembling of classifiers that each capture orthogonal representations of time series data. This architecture supersedes the original HIVE-COTE design by integrating two new constituents—Temporal Dictionary Ensemble (TDE) and Diverse Representation Canonical Interval Forest (DrCIF)—and introduces The Arsenal, an ensemble of smaller ROCKET classifiers, thereby addressing key limitations of earlier meta-ensembles in both accuracy and probabilistic calibration. HIVE-COTE 2.0 achieves significant empirical gains on diverse benchmarks, notably the UCR and UEA time series archives, and establishes a refined interplay between deep dictionary learning, interval-based forests, convolutional models, and shapelet-based methods (Middlehurst et al., 2021, Middlehurst et al., 2021).
1. Architectural Overview
HIVE-COTE 2.0 (HC2) is composed of four independently trained modules:
- Shapelet Transform Classifier (STC): Deploys randomised search to extract phase-independent discrete subsequence features, leveraging Rotation Forests for induced vector representations.
- Temporal Dictionary Ensemble (TDE): Constructs an ensemble of histogram-based, bag-of-words classifiers on Symbolic Fourier Approximation using intensive Gaussian-process-guided hyperparameter search and spatial pyramids.
- Diverse Representation Canonical Interval Forest (DrCIF): Extends interval-based modeling to capture summary statistics over random intervals of the series, its first differences, and periodograms, using an extensive feature pool (including catch22 features).
- The Arsenal: Aggregates an ensemble of ROCKET classifiers, each trained with 2000 random convolutional kernels and ridge regression, designed specifically to ensure probabilistic calibration needed for weighted meta-ensembling.
Each module provides a probability distribution for a given test series . Module outputs are combined using the CAWPE meta-learner, which weights each by a function of its internal estimated accuracy:
This structure allows both accurate and interpretably "tilted" weighting toward strong modules, while being empirically robust to overfitting (Middlehurst et al., 2021).
2. Temporal Dictionary Ensemble (TDE)
TDE is an adaptive homogeneous ensemble of 1-NN classifiers, operating on histograms derived from bag-of-words representations constructed using Symbolic Fourier Approximation (SFA). The pipeline comprises:
- Feature Extraction: Sliding windows of length are extracted; each window is optionally -normalized and transformed via DFT to obtain a truncated coefficient vector.
- Discretization: Coefficients are discretized into words using either Multiple Coefficient Binning (MCB, as in BOSS) or Information-Gain Binning (IGB, as in WEASEL). Alphabet size is fixed (typically ).
- Spatial Pyramids and Bigrams: Multiple pyramid levels () segment histograms to recover spatial locality; bigram counts at the finest level increase discriminative capacity.
- Ensemble Construction: Candidate parameter tuples are subsampled and evaluated on 70% random subsets. Leave-one-out cross-validated accuracy on the subsample is used for model selection.
- Gaussian Process Parameter Search: The high-dimensional search over hyperparameters is accelerated by a sequential model-based optimization: the first 50 candidates are random; subsequent selections maximize an acquisition function (e.g., expected improvement) using a GP surrogate model fit to observed parameter-accuracy pairs.
Base classifiers are retained if their accuracy exceeds the minimum in the ensemble (size out of candidates). Aggregation during inference uses accuracy as classifier weights, employing histogram-intersection for 1-NN voting: The predicted class is (Middlehurst et al., 2021, Middlehurst et al., 2021).
3. Additional Modules: DrCIF, The Arsenal, and STC
- DrCIF: Constructs a forest of interval trees using features from multiple representations (original, difference, periodogram). Each tree randomly selects intervals and a subset of 29 candidate features (statistical summaries plus catch22), building efficient splits via information gain. In the multivariate case, intervals are sampled across dimensions. The tree-based structure is highly parallelizable and readily extended.
- The Arsenal: Composed of smaller ROCKET classifiers, each with randomly parameterized 1-D convolutional kernels. Features comprise maximum values and positive proportion for each kernel. Probabilistic calibration is achieved via averaging outputs from independent ridge regression fits.
- STC: Utilizes a random contract-limited search for shapelets, then summarizes each time series as a vector of minimum Euclidean distances to retained shapelets; a Rotation Forest performs final classification.
This modularity enables rapid training and precise probabilistic ensemble combination. Hyperparameters are fixed to empirically determined ranges balancing computational constraints and representational capacity.
4. Computational Complexity and Resource Footprint
The dominant computational costs arise from TDE and STC:
- TDE: for DFTs and for LOOCV histogram distances; spatial pyramid increases histogram space.
- DrCIF: , typically linear in data size due to bounded .
- The Arsenal: for feature transformation and for ridge regression.
- STC: Randomized search caps cost at 1 hour; each distance computation is naïvely.
Empirical runtime and memory profiles (single-threaded) on 112 UCR datasets:
- HC2 median train time per dataset: 182 min (vs. 229 min for HC1)
- Memory footprint: HC2 peaks at 6.6 GB (vs. 4.9 GB for HC1)
- TDE incurs higher memory load (histograms, spatial pyramids), but lower training time variance compared to BOSS (Middlehurst et al., 2021, Middlehurst et al., 2021).
5. Experimental Evaluation on Standard Benchmarks
HC2 demonstrates statistically significant improvement over all prior SOTA methods. On 112 univariate UCR datasets:
- Mean accuracy: HC2 HC1 by +1.06%, InceptionTime by +1.69%, TS-CHIEF by +1.36%, ROCKET by +2.49%
- Probability calibration: HC2 delivers lower NLL and the best AUROC.
- Per-module contribution: Ablation studies confirm each module's necessity, with full HC2 outperforming any subset.
On 26 multivariate UEA datasets:
- Accuracy gains: +2.25% over HC1, +2.52% over ROCKET, +1.71% over CIF, +8.22% over DTW-D.
- Efficiency: Contracting is supported: on the five largest problems, 4h/12h contracts yield 98/99% of full accuracy.
Statistical tests (Wilcoxon, critical difference diagrams) and detailed runtime/memory tables substantiate these claims. Module outputs are stable under variation in train set size due to the bagging and weighting scheme in CAWPE.
6. Theoretical and Empirical Innovations
Major advancements over prior art include:
- Replacement of BOSS with TDE, leveraging GP-based hyperparameter optimization, spatial pyramids, and bigram features for dictionary modeling.
- Adoption of DrCIF, integrating interval-based summaries with catch22 features and parallel time series representations.
- Arsenal module resolves ROCKET’s calibration issue via ensembling for probabilistically meaningful class scores.
- Consistent bagging-based accuracy estimation across all modules for CAWPE weighting.
- Implementation of “contracting”: all modules can operate under fixed compute-time budgets without significant loss in performance.
- Empirical studies demonstrate that combining all four modules is critical; no two- or three-module subset matches the aggregate accuracy or calibration (Middlehurst et al., 2021).
7. Limitations and Prospective Enhancements
Identified limitations include:
- TDE memory/latency: Memory requirements are %%%%2728%%%% BOSS; prediction is slower, primarily due to histogram size and 1-NN search.
- Low-latency applications: Alternate classifiers such as WEASEL may be advantageous where inference speed is a paramount constraint.
- Scalability: Ongoing work targets sparse histogram representations, approximate nearest neighbor techniques (e.g., LSH), and further application of GP-guided optimization to other modules to improve efficiency.
- Calibration and meta-learning: Potential advances include joint optimization of module weights via a second-level GP, as well as module selection strategies beyond exponentiated accuracy weighting (Middlehurst et al., 2021, Middlehurst et al., 2021).
HIVE-COTE 2.0, via these architectural and methodological innovations, establishes a new benchmark for time series classification, excelling across a wide diversity of datasets and evaluation criteria. Ongoing research is focused on further improvements in scalability and modular generalization.