Papers
Topics
Authors
Recent
Search
2000 character limit reached

HDBSCAN Hierarchical Clustering

Updated 19 January 2026
  • Hierarchical clustering via HDBSCAN is a density-based method that builds a hierarchy of clusters using mutual reachability distance and stability measures.
  • It employs a condensed cluster tree and stability-based extraction to obtain robust flat clusters, mitigating the chaining effect common in traditional methods.
  • Extensions such as kernelization, incremental updates, and hybrid selection mechanisms enhance its scalability and adaptability to large-scale or dynamic data.

Hierarchical Clustering via HDBSCAN

Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) is a density-based clustering algorithm that generalizes DBSCAN by producing a hierarchy of clusters from data with heterogeneous densities. Unlike traditional single-linkage hierarchical clustering, HDBSCAN leverages mutual reachability distance to mitigate chaining effects and utilizes a stability measure to extract robust, flat cluster assignments from a condensed cluster tree. Recent developments include hybrid selection mechanisms bridging HDBSCAN and DBSCAN*, kernelization for varying densities, efficient multi-hierarchy computations, incremental and scalable variants for dynamic or large-scale data, and specialized adaptations for structure discovery in biological, astronomical, and graph data.

1. Foundational Workflow and Mathematical Formulation

The HDBSCAN pipeline consists of transforming a metric space (X,d)(X, d) into a density-aware topology, followed by extracting a hierarchical cluster tree and summarizing it into a flat clustering via stability selection:

  1. Core Distance: For each data point xXx \in X and a user-supplied parameter minPts\text{minPts}, compute the core-distance:

dcore(x)=distance from x to its minPtsth nearest neighbor.d_{\mathrm{core}}(x) = \text{distance from } x \text{ to its } \text{minPts}^{\text{th}} \text{ nearest neighbor}.

  1. Mutual Reachability Distance: For points xp,xqx_p, x_q, define:

dmreach(xp,xq)=max{dcore(xp), dcore(xq), d(xp,xq)}.d_{\mathrm{mreach}}(x_p, x_q) = \max\left\{ d_{\mathrm{core}}(x_p),\ d_{\mathrm{core}}(x_q),\ d(x_p, x_q) \right\}.

This metric regularizes single-linkage fashion, suppressing chaining through low-density bridges.

  1. Minimum Spanning Tree (MST) and Dendrogram: Construct a complete weighted graph with edge weights dmreach(xp,xq)d_{\mathrm{mreach}}(x_p, x_q), compute its MST, and perform a single-linkage hierarchy by progressively removing edges in order of decreasing weight.
  2. Condensed Cluster Tree: The hierarchical tree is pruned by minPts\text{minPts} to form candidate clusters. At each split:
    • Both children minPts\geq \text{minPts}: keep the split.
    • Both children <minPts< \text{minPts}: prune both.
    • One child <minPts< \text{minPts}: mark as noise; let larger child continue. This yields a smaller tree of candidate clusters at various density levels (Malzer et al., 2019, Sante et al., 11 Sep 2025).
  3. Cluster Stability and Selection: For cluster CiC_i,

S(Ci)=xjCi(λmax(xj,Ci)λmin(Ci)),S(C_i)=\sum_{x_j\in C_i}\Bigl( \lambda_{\max}(x_j, C_i) - \lambda_{\min}(C_i)\Bigr),

where λ=1/ϵ\lambda = 1/\epsilon, λmin(Ci)\lambda_{\min}(C_i) is the density at split-off, and λmax(xj,Ci)\lambda_{\max}(x_j, C_i) is the density at which xjx_j leaves CiC_i.

Flat clusters are determined by maximizing the sum of stabilities over disjoint cluster choices, using linear-time tree traversal (the "excess-of-mass", or eom, criterion) (Malzer et al., 2019, Sante et al., 11 Sep 2025, Bot et al., 2023, DeWolfe, 2 Sep 2025).

2. Hierarchical, Flat, and Hybrid Cluster Selection Mechanisms

Hierarchical clustering in HDBSCAN enables both the exploration of density-based substructure and robust flat cluster assignment:

  • Stability-based Extraction (EOM): The eom flat clustering is selected as the set of disjoint clusters maximizing total stability, ensuring that each root-to-leaf path has exactly one cluster selected (Malzer et al., 2019, McInnes et al., 2017).
  • Cluster Leaf Extraction: Alternatively, selecting all leaf segments yields finer partitions, often resulting in more micro-clusters.
  • Hybrid Threshold Mechanism (HDBSCAN  ϵ^\;{}^{\hat\epsilon}): To control over-partitioning in extremely dense regions (where a low minPts\text{minPts} produces numerous micro-clusters), a threshold ϵ^\hat\epsilon is specified. Splits in the hierarchy at ϵϵ^\epsilon \leq \hat\epsilon are forbidden, collapsing subtrees into DBSCAN^* clusters at that level, whereas elsewhere, HDBSCAN stability selection prevails. The optimization replaces S(Ci)S(C_i) with

ES(Ci)={λmin(Ci),if ϵmax(Ci)>ϵ^; 0,otherwise.ES(C_i) = \begin{cases} \lambda_{\min}(C_i), & \text{if } \epsilon_{\max}(C_i) > \hat\epsilon; \ 0, & \text{otherwise}. \end{cases}

This hybrid selection prevents micro-clustering in high-density areas while retaining sensitivity to sparse regions (Malzer et al., 2019).

3. Adaptations and Extensions: Kernelization, Incremental, Multi-scale, and Dynamic Data

Recent innovations extend HDBSCAN to diverse data modalities and address key limitations:

  • Kernelization via Isolation Kernel: To combat failure modes in variable density scenarios, the base metric dd can be replaced by a data-adaptive similarity KI\mathcal{K}_I, defined as the probability that xx and yy fall in the same cell of a random Voronoi partition. The kernel-induced distance dI(x,y)=1/KI(x,y)d_I(x, y)=1/\mathcal{K}_I(x, y) and associated core/mutual-reachability distances adapt to local densities, yielding superior dendrogram purity and flat F1_1 scores compared to Euclidean or Gaussian kernels (Han et al., 2020). The full pipeline supports these substitutions in MST and stability measures.
  • Efficient Multi-Hierarchy Computation: For parameter exploration, running HDBSCAN across a range of minPts\text{minPts} is resource intensive. By building a single relative neighborhood graph (RNG) with respect to the largest minPts\text{minPts}, followed by reweighting for all smaller values, one can extract all hierarchies for MM settings at roughly 2×2\times the cost of a single run, using subquadratic algorithms and well-separated pair decompositions (Neto et al., 2017).
  • Incremental and Scalable Variants: FISHDBC leverages an HNSW structure to incrementally maintain an approximate MST, enabling efficient updates and clustering of dynamic or non-metric datasets with complexity O(nlog2n)O(n\log^2 n), not requiring O(n2)O(n^2) distance computations (Dell'Amico, 2019). Bubble-tree summarization provides O(logL)O(\log L) update cost per point for dynamic clustering, compressing to LL bubbles for efficient batch HDBSCAN passes (Abduaziz et al., 2024).
  • Persistent Multiscale Clustering: PLSCAN constructs the hierarchical clustering tree for all minimum cluster sizes m[2,n]m \in [2, n], employing zero-dimensional persistent homology. Clusters correspond to persistent leaves in the size-filtration, identified via the persistence trace leaf at mpsize()\sum_{\text{leaf at }m} p_{\text{size}}(\ell), overcoming the need for manual mm selection and producing multi-scale robustness (Bot et al., 18 Dec 2025).

4. Parameter Optimization, Validation Metrics, and Empirical Performance

The selection of minPts\text{minPts}, minClusterSize\text{minClusterSize}, and kernel/threshold parameters is crucial for cluster recovery:

  • Parameter Tuning: Bayesian optimization frameworks such as Optuna can tune hyperparameters (e.g., minSamples\text{minSamples}, minClusterSize\text{minClusterSize}, and hybrid ϵ\epsilon thresholds) against external (e.g., V-measure) and internal (e.g., DBCV) criteria for labeled or unlabeled data (Sante et al., 11 Sep 2025).
  • Validation Metrics: Adjusted Rand Index (ARI), V-measure, DBCV, dendrogram purity, and flat clustering F1_1-score provide quantitative evaluations. Stability/persistence measures directly guide cluster selection (Bot et al., 18 Dec 2025, Han et al., 2020, Malzer et al., 2019).
  • Empirical Performance: Hybrid HDBSCANϵ^^{\hat\epsilon} outperforms standard methods for data with variable densities, reducing sensitivity to ϵ^\hat\epsilon compared to DBSCAN's ϵ\epsilon. PLSCAN achieves higher ARI than HDBSCAN* EOM and less sensitivity to mutual reachability neighbor parameters. Kernelized HDBSCAN yields highest purity and F1_1 on diverse datasets. FISHDBC and bubble-tree methods allow clustering at scale with incremental or dynamic data workloads.

5. Computational Complexity and Practical Implementation

The efficiency of HDBSCAN and its variants depends on data size, dimensionality, and index structure:

  • Static HDBSCAN: Accelerated variants use space-tree indexes (kd-tree, ball-tree) for O(nlogn)O(n \log n) average-case complexity in core-distance and MST; worst-case is O(n2)O(n^2) if spatial structure is absent (McInnes et al., 2017). Flat clustering extraction is O(n)O(n).
  • Incremental Approaches: FISHDBC achieves O(nlog2n)O(n \log^2 n) via HNSW neighbor hulls and MST updates over sparse candidate edges (Dell'Amico, 2019). Bubble-tree online summarization costs O(logL)O(\log L) per update with offline O(LlogL)O(L \log L) for MST-based clustering (Abduaziz et al., 2024).
  • Parallel and Multi-hierarchy: Polylogarithmic depth and aggressive memory optimizations (10×10\times savings) enable 11–56×\times speedup on large multicore machines via well-separated pair decomposition and memoized filtered Kruskal MST extraction (Wang et al., 2021, Neto et al., 2017).
  • Parameter Sweeping and Multi-scale Selection: Efficient multi-hierarchy construction allows hundreds of cluster hierarchies in time nearly linear with data size, enabling thorough parameter exploration (Neto et al., 2017, Bot et al., 18 Dec 2025).

6. Specialized Applications: Biological, Astronomical, and Graph Data

HDBSCAN's hierarchical density-based framework is employed in:

  • Galaxy Merger Reconstruction: In chemodynamical space, optimized HDBSCAN with tuned minSamples\text{minSamples}, minClusterSize\text{minClusterSize}, and selection ϵ\epsilon achieves high purity and recovers merger progenitors up to zacc3z_{\rm acc}\sim3 (Sante et al., 11 Sep 2025).
  • Flare-Sensitive Clustering: FLASC post-processes HDBSCAN clusters to detect branching structure (flares) within clusters, assigning sub-cluster labels by single-linkage over centrality-weighted intra-cluster graphs. Flare detection enriches subpopulation identification in biomedical and cellular development datasets (Bot et al., 2023).
  • Community Detection in Graphs: HDBSCAN is adapted to similarity matrices from node or line graphs and projects edge clusters back to overlapping node communities, supporting flexible, outlier-aware community detection on synthetic and real-world graphs (DeWolfe, 2 Sep 2025).

7. Limitations, Trade-offs, and Practical Considerations

  • Parameter Sensitivity: In extremely high dimensions, selecting meaningful ϵ\epsilon or kernel parameters can be nontrivial; hybrid variants inherit global-threshold weaknesses only where thresholding is forced (Malzer et al., 2019, Han et al., 2020).
  • Resolution Limits: Excessive pruning may mask real substructure; conversely, permissive thresholds can lead to micro-clustering.
  • Computational Constraints: Full pairwise computations remain expensive in non-metric spaces and high dimensionality; approximation and parallelization techniques mitigate this but may regularize or coarsen output.
  • Empirical Robustness: Persistent multi-scale clustering, kernelization, and dynamic summarization improve stability across datasets, facilitating exploratory analysis without extensive manual parameter tuning or repeated batch runs.

In summary, hierarchical clustering via HDBSCAN establishes a rigorous density-based cluster hierarchy, enables both global and local-stability driven flat selection, and supports diverse extensions—hybrid mechanisms, kernel adaptations, incremental and multi-scale computation, and application to specialized scientific domains—resulting in robust, scalable, and interpretable clusterings across heterogeneous and dynamic datasets (Malzer et al., 2019, Bot et al., 18 Dec 2025, Han et al., 2020, McInnes et al., 2017, Neto et al., 2017, Abduaziz et al., 2024, Sante et al., 11 Sep 2025, Bot et al., 2023, DeWolfe, 2 Sep 2025, Dell'Amico, 2019, Wang et al., 2021).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Clustering via HDBSCAN.