Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hybrid H3+ Topology Clustering Method

Updated 2 February 2026
  • The paper introduces a hybrid approach that integrates H3 spatial indexing with route-segment graph topology to mitigate giant clustering issues.
  • It implements a three-stage clustering pipeline—H3-based agglomeration, topology-aware subdivision with Louvain detection, and size rebalancing—to optimize cluster coherence.
  • Quantitative evaluations show improved balance with reduced cluster size variation (CV of 0.608) and an imbalance ratio of 1.90×, enhancing delay prediction accuracy.

The hybrid H3+topology clustering method is an algorithmic framework designed to partition large-scale urban bus networks into balanced, topologically coherent clusters that reflect both spatial contiguity and operational connectivity. Developed to address the "giant cluster" problem associated with naive H3-based partitioning—wherein dense urban regions dominate cluster assignments—this approach integrates Uber's H3 spatial indexing with route-segment graph topology, producing clusters optimized for distributed delay prediction modeling and scalable city-scale inference (Boudabbous et al., 26 Jan 2026).

1. H3 Spatial Indexing Overview

Uber’s H3 spatial indexing system provides a hierarchical, hexagonal tessellation of the globe across 16 discrete resolution levels (0–15), with each higher resolution subdividing the area of the previous level by a factor of approximately 1/7. In the transit clustering context, each stop and segment midpoint in the network is mapped to an H3 cell using the geo_to_h3 API, allowing representation of spatial observations in geodesic-consistent bins.

Empirical grid search over candidate H3 resolutions (6, 7, and 8) determined that resolution 7 (approximately 5 km² per hexagon) achieves the optimal compromise between granularity and per-cell data support for modeling tasks in city-scale deployments. This selection is informed by the need for sufficient spatial differentiation while avoiding excessive sparsity in data-rich environments (Boudabbous et al., 26 Jan 2026).

2. Topology-Based Clustering via Route-Segment Graphs

Network topology is formalized by constructing an undirected graph G=(V,E)G=(V,E), where each vertex corresponds to a transit route (GTFS route_id), and edges encode shared physical segments—specifically, a directed pair of consecutive stops present on both routes. The resulting graph structure quantifies operational connectivity and allows explicit measurement of route overlap via the Jaccard index:

Jsegment(i,j)=∣Segmentsi∩Segmentsj∣∣Segmentsi∪Segmentsj∣.J_{\mathrm{segment}}(i,j) = \frac{|\mathrm{Segments}_i \cap \mathrm{Segments}_j|}{|\mathrm{Segments}_i \cup \mathrm{Segments}_j|}.

This similarity captures the extent to which two routes serve common physical infrastructure, a crucial facet often missed by purely spatial clustering (Boudabbous et al., 26 Jan 2026).

3. Hybrid Similarity and Three-Stage Clustering Pipeline

The hybrid similarity metric combines the H3-based spatial Jaccard index (JH3J_{\mathrm{H3}}) and the segment-graph Jaccard (JsegmentJ_{\mathrm{segment}}), parameterized by spatial weight wspatial∈[0,1]w_{\mathrm{spatial}}\in [0,1]:

similaritycombined(i,j)=wspatial⋅JH3(i,j)+(1−wspatial)⋅Jsegment(i,j),\mathrm{similarity}_{\mathrm{combined}}(i,j) = w_{\mathrm{spatial}} \cdot J_{\mathrm{H3}}(i,j) + (1-w_{\mathrm{spatial}}) \cdot J_{\mathrm{segment}}(i,j),

with the corresponding distance d(i,j)=1−similaritycombined(i,j)d(i,j) = 1 - \mathrm{similarity}_{\mathrm{combined}}(i,j). The optimal configuration employs wspatial=0.5w_{\mathrm{spatial}} = 0.5 to balance geographic and topological signals.

The pipeline follows three sequential stages:

  1. H3-only agglomerative clustering: Assign each route’s observations to H3 cells (resolution 7); cluster route centroids using Ward’s linkage on the pairwise hybrid distance D(i,j)D(i,j). Identify "giant clusters" exceeding 40% of total network observations.
  2. Topology-aware subdivision: For giant clusters, extract subgraphs GCG_C induced by their member routes, and use Louvain community detection with JsegmentJ_{\mathrm{segment}} edge weights to yield finer, topologically coherent clusters.
  3. Size rebalancing: Clusters smaller than a minimum threshold are merged with their nearest neighbor (in combined similarity space); clusters exceeding the maximum threshold are recursively subdivided.

The pipeline includes a high-level pseudocode for reproducibility, with parameter choices for H3 resolution, spatial weight, cluster count, and thresholds (Boudabbous et al., 26 Jan 2026).

4. Parameter Selection and Optimization Procedure

Parameterization involves an initial grid search over discrete combinations of H3 resolution (6,7,8{6,7,8}), number of clusters (8,10,12,15,20{8,10,12,15,20}), spatial weights (0.3,0.5,0.7{0.3,0.5,0.7}), and linkage strategies (Ward, Complete, Average). This is followed by Bayesian optimization (50 Optuna trials) of continuous parameters to maximize a composite objective that favors low coefficient of variation (CV) for cluster size, high silhouette score, and high topological coherence:

0.5⋅(−CV)+0.3⋅(silhouette)+0.2⋅(topological coherence).0.5 \cdot (-\mathrm{CV}) + 0.3 \cdot (\text{silhouette}) + 0.2 \cdot (\text{topological coherence}).

Empirically, the best configuration for the Montréal STM network consists of 12 clusters, H3 resolution 7, and wspatial=0.5w_{\mathrm{spatial}}=0.5. Size thresholds ensure clusters contain between approximately 9 and 31 routes (Boudabbous et al., 26 Jan 2026).

5. Quantitative Evaluation and Cluster Characteristics

The selected configuration demonstrates significant improvements over naive spatial clustering:

  • Cluster size CV: 0.608 versus >2.0 for H3-only (resolution 8).
  • Imbalance ratio (largest to smallest cluster): 1.90× (versus >8× for baseline).
  • Routes per cluster: [31, 29, 28, 23, 22, 17, 13, 12, 10, 9, 1, 1]. The two singleton clusters correspond to specialized express routes with minimal overlap.
  • Intra-cluster spatial overlap (⟨JH3⟩\langle J_{\mathrm{H3}} \rangle): <0.15, demonstrating maintained geographic coherence.

These results validate that the hybrid method yields balanced, coherent, and operationally meaningful clusters that facilitate efficient and scalable model training for delay prediction tasks. Illustrative figures (e.g., Figure 1.2) and tabular summaries (e.g., Table 4.1) in the original work document both the partitioning progression and the quantitative benefits (Boudabbous et al., 26 Jan 2026).

6. Implementation Guidelines and Adaptability

Adapting the methodology to new bus networks requires re-execution of the two-phase parameter search, with key points of adjustment:

  • H3 resolution should reflect the network’s spatial scale and stop density.
  • Spatial weight (wspatialw_{\mathrm{spatial}}) should be tuned to accommodate network-specific geography and operational redundancies (more segment overlap may require increased topological weighting).
  • Cluster count should ensure sufficient per-cluster data for robust model training while preserving heterogeneity.
  • Validation metrics for any new deployment include cluster size CV (<1 target), imbalance ratio (<3 target), intra-cluster spatial and topological coherence.

The modular three-stage pipeline allows simplification (e.g., omitting topology subdivision if "giant" clusters are absent) or further merging if fragmentation leads to undersized clusters (Boudabbous et al., 26 Jan 2026).

7. Role in City-Scale Delay Prediction Systems

Integration of the hybrid H3+topology clustering method within the broader delay prediction pipeline enables scalable, city-wide model training with improved data balance and representativeness. By preserving both spatial and topological structure, the approach mitigates cluster domination by dense urban cores and maintains operational relevance across heterogeneous route types. Its practical application in the Société de transport de Montréal demonstrated that global LSTM-based models with cluster-aware features could outperform more complex architectures (e.g., transformers) with significant gains in efficiency and comparable or superior accuracy, supporting real-time deployment for urban transit operations (Boudabbous et al., 26 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hybrid H3+Topology Clustering Method.