Learning-Augmented Hierarchical Clustering
Abstract: Hierarchical clustering (HC) is an important data analysis technique in which the goal is to recursively partition a dataset into a tree-like structure while grouping together similar data points at each level of granularity. Unfortunately, for many of the proposed HC objectives, there exist strong barriers to approximation algorithms with the hardness of approximation. Thus, we consider the problem of hierarchical clustering given auxiliary information from natural oracles. Specifically, we focus on a splitting oracle which, when provided with a triplet of vertices $(u,v,w)$, answers (possibly erroneously) the pairs of vertices whose lowest common ancestor includes all three vertices in an optimal tree, i.e., identifying which vertex ``splits away'' from the others. Using such an oracle, we obtain the following results: - A polynomial-time algorithm that outputs a hierarchical clustering tree with $O(1)$-approximation to the Dasgupta objective (Dasgupta [STOC'16]). - A near-linear time algorithm that outputs a hierarchical clustering tree with $(1-o(1))$-approximation to the Moseley-Wang objective (Moseley and Wang [NeurIPS'17]). Under the plausible Small Set Expansion Hypothesis, no polynomial-time algorithm can achieve any constant approximation for Dasgupta's objective or $(1-C)$-approximation for the Moseley-Wang objective for some constant $C>0$. As such, our results demonstrate that the splitting oracle enables algorithms to outperform standard HC approaches and overcome hardness constraints. Furthermore, our approaches extend to sublinear settings, in which we show new streaming and PRAM algorithms for HC with improved guarantees.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.