Papers
Topics
Authors
Recent
Search
2000 character limit reached

Optimization of Inter-group Criteria for Clustering with Minimum Size Constraints

Published 13 Jan 2024 in cs.LG and cs.DS | (2401.07091v1)

Abstract: Internal measures that are used to assess the quality of a clustering usually take into account intra-group and/or inter-group criteria. There are many papers in the literature that propose algorithms with provable approximation guarantees for optimizing the former. However, the optimization of inter-group criteria is much less understood. Here, we contribute to the state-of-the-art of this literature by devising algorithms with provable guarantees for the maximization of two natural inter-group criteria, namely the minimum spacing and the minimum spanning tree spacing. The former is the minimum distance between points in different groups while the latter captures separability through the cost of the minimum spanning tree that connects all groups. We obtain results for both the unrestricted case, in which no constraint on the clusters is imposed, and for the constrained case where each group is required to have a minimum number of points. Our constraint is motivated by the fact that the popular Single Linkage, which optimizes both criteria in the unrestricted case, produces clusterings with many tiny groups. To complement our work, we present an empirical study with 10 real datasets, providing evidence that our methods work very well in practical settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Individual preference stability for clustering. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  197–246. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/ahmadi22a.html.
  2. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. SIAM J. Comput., 49(4), 2020. doi: 10.1137/18M1171321. URL https://doi.org/10.1137/18M1171321.
  3. K-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’07, pp.  1027–1035, USA, 2007. Society for Industrial and Applied Mathematics. ISBN 9780898716245. URL https://dl.acm.org/doi/10.5555/1283383.1283494.
  4. Characterization, stability and convergence of hierarchical clustering methods. J. Mach. Learn. Res., 11:1425–1470, 2010.
  5. Approximate hierarchical clustering via sparsest cut and spreading metrics. In Klein, P. N. (ed.), Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2017, Barcelona, Spain, Hotel Porta Fira, January 16-19, pp.  841–854. SIAM, 2017. doi: 10.1137/1.9781611974782.53. URL https://doi.org/10.1137/1.9781611974782.53.
  6. A constant-factor approximation algorithm for the k-median problem. J. Comput. Syst. Sci., 65(1):129–149, 2002. doi: 10.1006/jcss.2002.1882. URL https://doi.org/10.1006/jcss.2002.1882.
  7. Hierarchical clustering: Objective functions and algorithms. J. ACM, 66(4):26:1–26:42, 2019. doi: 10.1145/3321386. URL https://doi.org/10.1145/3321386.
  8. The exact lpt-bound for maximizing the minimum completion time. Oper. Res. Lett., 11(5):281–287, 1992. doi: 10.1016/0167-6377(92)90004-M. URL https://doi.org/10.1016/0167-6377(92)90004-M.
  9. Dasgupta, S. A cost function for similarity-based hierarchical clustering. In Wichs, D. and Mansour, Y. (eds.), Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 18-21, 2016, pp.  118–127. ACM, 2016. doi: 10.1145/2897518.2897527. URL https://doi.org/10.1145/2897518.2897527.
  10. A Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2):224–227, April 1979. ISSN 1939-3539. doi: 10.1109/TPAMI.1979.4766909. Conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence.
  11. Improved approximation algorithms for MAX k-cut and MAX BISECTION. Algorithmica, 18(1):67–81, 1997. doi: 10.1007/BF02523688. URL https://doi.org/10.1007/BF02523688.
  12. Computers and Intractability: A Guide to the Theory of NP-Completeness (Series of Books in the Mathematical Sciences). W. H. Freeman, first edition edition, 1979. ISBN 0716710455. URL http://www.amazon.com/Computers-Intractability-NP-Completeness-Mathematical-Sciences/dp/0716710455.
  13. Diversity in machine learning. Ieee Access, 7:64323–64350, 2019.
  14. Gonzalez, T. F. Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci., 38:293–306, 1985. doi: 10.1016/0304-3975(85)90224-5. URL https://doi.org/10.1016/0304-3975(85)90224-5.
  15. An overview of methods maintaining diversity in genetic algorithms. International journal of emerging technology and advanced engineering, 2(5):56–60, 2012.
  16. Handbook of Cluster Analysis. Chapman and Hall/CRC, 2015.
  17. Hofmeyr, D. P. Connecting spectral clustering to maximum margins and level sets. The Journal of Machine Learning Research, 21(1):630–664, 2020.
  18. Data clustering: A review. ACM Comput. Surv., 31(3):264–323, September 1999. ISSN 0360-0300.
  19. Kleinberg, J. M. An impossibility theorem for clustering. In Advances in neural information processing systems, 2002.
  20. Algorithm design. Addison-Wesley, 2006. ISBN 978-0-321-37291-8.
  21. Lloyd, S. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.
  22. Approximation bounds for hierarchical clustering: Average linkage, bisecting k-means, and local search. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp.  3094–3103, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/d8d31bd778da8bdd536187c36e48892b-Abstract.html.
  23. Rousseeuw, P. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math., 20(1):53–65, 1987. ISSN 0377-0427. doi: http://dx.doi.org/10.1016/0377-0427(87)90125-7. URL http://portal.acm.org/citation.cfm?id=38772.
  24. Hierarchical clustering via spreading metrics. In Lee, D. D., Sugiyama, M., von Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp.  2316–2324, 2016. URL https://proceedings.neurips.cc/paper/2016/hash/4d2e7bd33c475784381a64e43e50922f-Abstract.html.
  25. Woeginger, G. J. A polynomial-time approximation scheme for maximizing the minimum machine completion time. Oper. Res. Lett., 20(4):149–154, 1997. doi: 10.1016/S0167-6377(96)00055-7. URL https://doi.org/10.1016/S0167-6377(96)00055-7.
  26. Zahn, C. T. Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on computers, 100(1):68–86, 1971.
Citations (1)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.