Evolution of $K$-means solution landscapes with the addition of dataset outliers and a robust clustering comparison measure for their analysis
Abstract: The $K$-means algorithm remains one of the most widely-used clustering methods due to its simplicity and general utility. The performance of $K$-means depends upon location of minima low in cost function, amongst a potentially vast number of solutions. Here, we use the energy landscape approach to map the change in $K$-means solution space as a result of increasing dataset outliers and show that the cost function surface becomes more funnelled. Kinetic analysis reveals that in all cases the overall funnel is composed of shallow locally-funnelled regions, each of which are separated by areas that do not support any clustering solutions. These shallow regions correspond to different types of clustering solution and their increasing number with outliers leads to longer pathways within the funnel and a reduced correlation between accuracy and cost function. Finally, we propose that the rates obtained from kinetic analysis provide a novel measure of clustering similarity that incorporates information about the paths between them. This measure is robust to outliers and we illustrate the application to datasets containing multiple outliers.
- S. Lloyd. Least squares quantization in PCM. IEEE Trans. Inf. Theory, 28:129–137, 1982.
- Landscape of clustering algorithms. In Proc. of the IAPR Int. Conf. Pattern Recognit., pages 260–263, 2004.
- Clustering cancer gene expression data: a comparative study. BMC Bioinform., 9:497, 2008.
- The planar K𝐾Kitalic_K-means problem is NP-hard. Theor. Comput. Sci., 442:13–21, 2012.
- D. Steinley. Local optima in K𝐾Kitalic_K-means clustering: what you don’t know may hurt you. Psychol. Methods, 8:294–304, 2003.
- Refining initial points for K𝐾Kitalic_K-means clustering. Int. Conf. Mach. Learn., 1:91–99, 1998.
- V. Faber. Clustering and the continuous K𝐾Kitalic_K-means algorithm. Los Alamos Science, 22:138–144, 1994.
- D. Arthur and S. Vassilvitskii. k𝑘kitalic_k-means++: the advantages of careful seeding. In Proc. of the 18th Ann. ACM-SIAM Symp. on Discrete Algorithms, pages 1027–1035, 2007.
- Robust partitional clustering by outlier and density insensitive seeding. Pattern Recognit. Lett., 30:994–1002, 2009.
- Fast and provably good seedings for k𝑘kitalic_k-means. In Proc. of the 30th Int. Conf. on Neural Information Processing Systems, pages 55–63, 2016.
- The validation of four ultrametric clustering algorithms. Pattern Recognit., 12:41–50, 1980.
- T. Su and J. G. Dy. Another look at non-random methods for initializing K𝐾Kitalic_K-means clustering. In Proc. of the 16th IEEE Int. Conf. Tools Art. Intell., pages 784–786, 2004.
- K. Krishna and M. N. Murty. Genetic K𝐾Kitalic_K-means algorithm. IEEE Trans. Syst. Man Cybern., 29:433–439, 1999.
- P. Fränti. Genetic algorithm with deterministic crossover for vector quantization. Pattern Recognit. Lett., 21:61–68, 2000.
- Outlier detection using improved genetic k𝑘kitalic_k-means. arXiv, 2014.
- Optimising K𝐾Kitalic_K-means clustering results with standard software packages. Comput. Stat. Data Anal., 49:969–973, 2005.
- A combined approach for clustering based on k𝑘kitalic_k-means and gravitational search algorithms. Swarm Evol. Comput., 6:47–52, 2012.
- k𝑘kitalic_k-walks: clustering gene-expression data using a k𝑘kitalic_k-means clustering algorithm optimised by random walks. Int. J. Data Mining Bioinf., 16:121–140, 2016.
- A modified bee colony optimization (MBCO) and its hybridization with k𝑘kitalic_k-means for an application to data clustering. Appl. Soft Comput., 70:590–603, 2018.
- AC coefficient and K𝐾Kitalic_K-means cuckoo optimisation algorithm-based segmentation and compression of compound images. IET Image Process., 12:218–225, 2018.
- Improving K𝐾Kitalic_K-means clustering with enhanced firefly algorithms. Appl. Soft Comput., 84:105763–105785, 2019.
- D. Steinley and M. J. Brusco. Initializing K𝐾Kitalic_K-means batch clustering: a critical evaluation of several techniques. J. Classif., 24:99–121, 2007.
- A survey of kernel and spectral methods for clustering. Pattern Recognit., 41:176–190, 2008.
- P. Rai and S. Singh. A survey of clustering techniques. Int. J. Comput. Appl., 7:1–5, 2010.
- A comparative study of efficient initialization methods for the K𝐾Kitalic_K-means clustering algorithm. Expert Syst. Appl., 40:200–210, 2013.
- D. J. Wales. Energy Landscapes. Cambridge University Press, Cambridge, 2003.
- L. Dicks and D. J. Wales. Elucidating the solution structure of the K𝐾Kitalic_K-means cost function using energy landscape theory. J. Chem. Phys., 156:054109, 2022.
- Archetypal solution spaces for clustering gene expression datasets in identifying cancer subtypes. arXiv, 2023.
- Data mining: a preprocessing engine. J. Comput. Sci., 2:735–739, 2006.
- L. A. García-Escudero and A. Gordaliza. Robustness properties of k𝑘kitalic_k-means and trimmed k𝑘kitalic_k-means. J. Am. Stat. Assoc., 94:956–969, 1999.
- Robust clustering by pruning outliers. IEEE Trans. Sys. Man Cybern., 33:983–999, 2003.
- Robust and sparse fuzzy K𝐾Kitalic_K-means clustering. In Proc. of the 25thsuperscript25normal-th25^{\mathrm{th}}25 start_POSTSUPERSCRIPT roman_th end_POSTSUPERSCRIPT Int. Joint Conf. Art. Intell., pages 2224–2230, 2016.
- Local search methods for k𝑘kitalic_k-means with outliers. In Proc. of the VLDB Endowment, pages 757–768, 2017.
- Robust multi-view data clustering with multi-view capped-norm k𝑘kitalic_k-means. Neurocomputing, 311:197–208, 2018.
- Robust and sparse k𝑘kitalic_k-means clustering for high-dimensional data. Adv. Data Anal. Classif., 13:905–932, 2019.
- A robust k𝑘kitalic_k-means clustering algorithm based on observation point mechanism. Complexity, 2020.
- M. B. Al-Zoubi. An effective clustering-based approach for outlier detection. Eur. J. Sci. Res., 28:310–316, 2009.
- Two-phase clustering process for outliers detection. Pattern Recognit. Lett., 22:691–700, 2001.
- A local search algorithm for k𝑘kitalic_k-means with outliers. Neurocomputing, 450:230–241, 2021.
- J. C. Dunn. Well-separated clusters and optimal fuzzy partitions. J. Cybern., 4:95–104, 1974.
- A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell., 1:224–227, 1979.
- P. J. Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Comput. Appl. Math., 20:53–65, 1987.
- W. M. Rand. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc., 66:846–850, 1971.
- L. Hubert and P. Arabie. Comparing partitions. J. Classif., 2:193–218, 1985.
- A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc., 78:553–569, 1983.
- M. Meilă. Comparing clusterings – an information based distance. J. Multivar. Anal., 98:873–895, 2007.
- Comparing two clusterings using matchings between clusters of clusters. J. Exp. Algorithmics, 24:1–41, 2019.
- Symmetries of activated complexes. Trans. Faraday Soc., 64:371–377, 1968.
- Defect migration in crystalline silicon. Phys. Rev. B, 59:3969–3980, 1999.
- A climbing image nudged elastic band method for finding saddle points and minimum energy paths. J. Chem. Phys., 113:9901–9904, 2000.
- A growing string method for determining transition states: comparison to the nudged elastic band and string methods. J. Chem. Phys., 120:7877–7886, 2004.
- Optimizing conical intersections without derivative coupling vectors application to multistate multireference second-order perturbation theory (MS-CASPT2). J. Phys. Chem. B, 112:405–413, 2008.
- J. Nocedal. Updating quasi-Newton matrices with limited storage. Math. Comput., 35:773–782, 1980.
- D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Math. Program., 45:503–528, 1989.
- K. Röder and D. J. Wales. Energy landscapes for the aggregation of Aβ17−42subscript𝛽1742\beta_{17-42}italic_β start_POSTSUBSCRIPT 17 - 42 end_POSTSUBSCRIPT. J. Am. Chem. Soc., 140:4018–4027, 2018.
- D. J. Wales. Discrete path sampling. Mol. Phys., 100:3285–3305, 2002.
- D. J. Wales. Some further applications of discrete path sampling to cluster isomerization. Mol. Phys., 102:891–908, 2004.
- F. Noé and S. Fischer. Transition networks for modelling the kinetics of conformational change in macromolecules. Curr. Opin. Struct. Biol., 18:154–162, 2008.
- Refined kinetic transition networks for the GB1 hairpin peptide. Phys. Chem. Chem. Phys., 11:3341–3354, 2009.
- Exploring the free energy landscape: from dynamics to networks and back. PLoS Comput. Biol., 5:e1000415, 2009.
- O. M. Becker and M. Karplus. The topology of multidimensional potential energy surfaces: theory and application to peptide structure and kinetics. J. Chem. Phys., 106:1495–1517, 1997.
- Archetypal energy landscapes. Nature, 394:758–760, 1998.
- H. Eyring. The activated complex and the absolute rate of chemical reactions. Chem. Rev., 17:65–77, 1935.
- M. G. Evans and M. Polanyi. Some applications of the transition state method to the calculation of reaction velocities, especially in solution. Trans. Faraday Soc., 31:875–894, 1935.
- Graph transformation method for calculating waiting times in Markov chains. J. Chem. Phys., 124:234110, 2006.
- D. J. Wales. Calculating rate constants and committor probabilities for transition networks by graph transformation. J. Chem. Phys., 130:204111–204118, 2009.
- E. W. Dijkstra. A note on two problems in connexion with graphs. Numerische Math., 1:269–271, 1959.
- R. A. Fisher. The use of multiple measurements in taxonomic problems. Ann. Eugen., 7:179–188, 1936.
- Rule induction in forensic science. Knowl. Based Sys., pages 152–160, 1989.
- D. Dua and C. Graff. UCI machine learning repository, 2017.
- Protein folding mechanisms and the multidimensional folding funnel. Proteins, 32:136–158, 1998.
- Funnels, pathways, and the energy landscape of protein folding: a synthesis. Proteins, 21:167–195, 1995.
- Theory of protein folding: the energy landscape perspective. Ann. Rev. Phys. Chem., 48:545–600, 1997.
- C. E. Shannon. A mathematical theory of communication. Bell Syst. Tech. J., 27:379–423, 1948.
- Defining and quantifying frustration in the energy landscape: applications to atomic and molecular clusters, biomolecules, jammed and glassy systems. J. Chem. Phys., 146:124103, 2017.
- T. Alqurashi and W. Wang. Clustering ensemble method. Int. J. Mach. Learn. & Cyber., 10:1227–1246, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.