Profiling and Modeling of Power Characteristics of Leadership-Scale HPC System Workloads
Abstract: In the exascale era in which application behavior has large power & energy footprints, per-application job-level awareness of such impression is crucial in taking steps towards achieving efficiency goals beyond performance, such as energy efficiency, and sustainability. To achieve these goals, we have developed a novel low-latency job power profiling machine learning pipeline that can group job-level power profiles based on their shapes as they complete. This pipeline leverages a comprehensive feature extraction and clustering pipeline powered by a generative adversarial network (GAN) model to handle the feature-rich time series of job-level power measurements. The output is then used to train a classification model that can predict whether an incoming job power profile is similar to a known group of profiles or is completely new. With extensive evaluations, we demonstrate the effectiveness of each component in our pipeline. Also, we provide a preliminary analysis of the resulting clusters that depict the power profile landscape of the Summit supercomputer from more than 60K jobs sampled from the year 2021.
- T. Wilde, A. Auweter, and H. Shoukourian, “The 4 Pillar Framework for energy efficient HPC data centers,” http://dx.doi.org/10.1007/s00450-013-0244-6, 2014.
- W. Shin, V. Oles, A. M. Karimi, J. A. Ellis, and F. Wang, “Revealing power, energy and thermal dynamics of a 200pf pre-exascale supercomputer,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021.
- M. Naghshnejad and M. Singhal, “Adaptive online runtime prediction to improve hpc applications latency in cloud,” in IEEE 11th International Conference on Cloud Computing. IEEE, 2018.
- C. Galleguillos, A. Sîrbu, Z. Kiziltan, O. Babaoglu, A. Borghesi, and T. Bridi, “Data-driven job dispatching in hpc systems,” in Machine Learning, Optimization, and BigData: 3rd International Conf., 2018.
- M. R. Wyatt, S. Herbein, T. Gamblin, A. Moody, D. H. Ahn, and M. Taufer, “Prionn: Predicting runtime and io using neural networks,” in 47th International Conference on Parallel Processing, 2018.
- R. McKenna, S. Herbein, A. Moody, T. Gamblin, and M. Taufer, “Machine learning predictions of runtime and io traffic on high-end clusters,” in International Conference on Cluster Computing, 2016.
- J. Emeras, S. Varrette, M. Guzek, and P. Bouvry, “Evalix: classification and prediction of job resource consumption on hpc platforms,” in Job Scheduling Strategies for Parallel Processing: 19th and 20th International Workshops. Springer, 2017, pp. 102–122.
- A. Sîrbu and O. Babaoglu, “Power consumption modeling and prediction in a hybrid cpu-gpu-mic supercomputer,” in Euro-Par: Parallel Processing: 22nd International Conference on Parallel and Distributed Computing, France, Proceedings 22. Springer, 2016.
- A. Matsunaga and J. A. Fortes, “On the use of machine learning to predict the time and resources consumed by applications,” in 10th International Conference on Cluster, Cloud and Grid Computing, 2010.
- C. Galleguillos, Z. Kiziltan, A. Netti, and R. Soto, “Accasim: a customizable workload management simulator for job dispatching research in hpc systems,” Cluster Computing, vol. 23, pp. 107–122, 2020.
- D. Klusáček, M. Soysal, and F. Suter, “Alea–complex job scheduling simulator,” in 13th International Conference, PPAM 2019,Poland.
- (2018) Summit. [Online]. Available: https://www.olcf.ornl.gov/olcf-resources/compute-systems/summit/
- (2023) Frontier. [Online]. Available: https://www.olcf.ornl.gov/olcf-resources/compute-systems/frontier/
- J. Thaler, W. Shin, S. Roberts, J. Rogers, and T. Rosedahl, “Hybrid approach to hpc cluster telemetry and hardware log analytics,” in IEEE High Performance Extreme Computing Conference (HPEC’20), 2020.
- (2015) OpenBMC Event subscription protocol. [Online]. Available: https://github.com/openbmc/docs/blob/master/rest-api.md
- A. Geiger, D. Liu, S. Alnegheimish, A. Cuesta-Infante, and K. Veeramachaneni, “Tadgan: Time series anomaly detection using generative adversarial networks,” in International Conf. on Big Data, 2020.
- S. Mukherjee, H. Asnani, E. Lin, and S. Kannan, “Clustergan: Latent space clustering in generative adversarial networks,” in AAAI, 2019.
- Q. Wang, Z. Ding, Z. Tao, Q. Gao, and Y. Fu, “Partial multi-view clustering via consistent gan,” in 2018 IEEE (ICDM), 2018.
- S. K. Moore, “AMD CEO: The Next Challenge Is Energy Efficiency,” https://spectrum-ieee-org.cdn.ampproject.org/c/s/spectrum.ieee.org/amp/amd-eyes-supercomputer-efficiency-gains-2659443759, 2023.
- W. J. Scheirer, A. de Rezende Rocha, A. Sapkota, and T. E. Boult, “Toward open set recognition,” vol. 35, no. 7, 2013, pp. 1757–1772.
- D. Miller, N. Sunderhauf, M. Milford, and F. Dayoub, “Class anchor clustering: A loss for distance-based open set recognition,” in Proceedings of the IEEE/CVF Winter Conference, 2021.
- X. Chen, Y. Duan, and et al., “Infogan: Interpretable representation learning by information maximizing generative adversarial nets,” Advances in neural information processing systems, vol. 29, 2016.
- M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in International conference on machine learning. PMLR, 2017, pp. 214–223.
- M. Ester, H.-P. Kriegel, J. Sander, X. Xu et al., “A density-based algorithm for discovering clusters in large spatial databases with noise.” in kdd, vol. 96, no. 34, 1996, pp. 226–231.
- P. Schlachter, Y. Liao, and B. Yang, “Deep open set recognition using dynamic intra-class splitting,” SN Computer Science, vol. 1, 2020.
- Z. Ge, S. Demyanov, Z. Chen, and R. Garnavi, “Generative openmax for multi-class open set classification,” arXiv preprint arXiv:1707.07418, 2017.
- K. Sohn, “Improved deep metric learning with multi-class n-pair loss objective,” in Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, Eds., vol. 29. Curran Associates, Inc., 2016.
- J. Bang, C. Kim, K. Wu, A. Sim, S. Byna, S. Kim, and H. Eom, “Hpc workload characterization using feature selection and clustering,” in Proceedings of the 3rd International Workshop on Systems and Network Telemetry and Analytics, 2020, pp. 33–40.
- Y. Fan and Z. Lan, “Dras-cqsim: A reinforcement learning based framework for hpc cluster scheduling,” Software Impacts, vol. 8, 2021.
- A. Bose, H. Yang, W. H. Hsu, and D. Andresen, “Hpcgcn: a predictive framework on high performance computing cluster log data using graph convolutional networks,” in International Conference on Big Data. IEEE, 2021.
- S. Presser, “Towards real-time classification of hpc workloads via out-of-band telemetry,” in 2022 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 2022, pp. 597–601.
- Y. Tsujita, A. Uno, R. Sekizawa, K. Yamamoto, and F. Sueyasu, “Job classification through long-term log analysis towards power-aware hpc system operation,” in 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, 2021.
- M. Müller, “Dynamic time warping,” Information retrieval for music and motion, pp. 69–84, 2007.
- S. Gharghabi, S. Imani, A. Bagnall, A. Darvishzadeh, and E. Keogh, “Matrix profile xii: Mpdist: a novel time series distance measure to allow data mining in more challenging scenarios,” in IEEE International Conference on Data Mining. IEEE, 2018.
- K. Amaral, Z. Li, W. Ding, S. Crouter, and P. Chen, “Summertime: Variable-length time series summarization with application to physical activity analysis,” ACM Transactions on Computing for Healthcare, vol. 3, no. 4, pp. 1–15, 2022.
- S. Aghabozorgi, A. S. Shirkhorshidi, and T. Y. Wah, “Time-series clustering–a decade review,” Information systems, 2015.
- T. W. Liao, “Clustering of time series data—a survey,” Pattern recognition, vol. 38, no. 11, pp. 1857–1874, 2005.
- A. Alqahtani, M. Ali, X. Xie, and M. W. Jones, “Deep time-series clustering: A review,” Electronics, vol. 10, no. 23, 2021.
- A. Javed, B. S. Lee, and D. M. Rizzo, “A benchmark study on time series clustering,” Machine Learning with Applications, vol. 1, 2020.
- N. Begum, L. Ulanova, J. Wang, and E. Keogh, “Accelerating dynamic time warping clustering with a novel admissible pruning strategy,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015, pp. 49–58.
- J. Paparrizos and L. Gravano, “Fast and accurate time-series clustering,” ACM Transactions on Database Systems, vol. 42, 2017.
- X. Wang, F. Yu, W. Pedrycz, and J. Wang, “Hierarchical clustering of unequal-length time series with area-based shape distance,” Soft Computing, vol. 23, pp. 6331–6343, 2019.
- D. Tiano, A. Bonifati, and R. Ng, “Featts: Feature-based time series clustering,” in Proceedings of the 2021 International Conference on Management of Data, 2021, pp. 2784–2788.
- R. Tavenard, J. Faouzi, G. Vandewiele, F. Divo, G. Androz, C. Holtz, M. Payne, R. Yurchak, M. Rußwurm, K. Kolar et al., “Tslearn, a machine learning toolkit for time series data,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 4686–4691, 2020.
- B. D. Fulcher, “Feature-based time-series analysis,” arXiv preprint arXiv:1709.08055, 2017.
- D. J. Trosten, A. S. Strauman, M. Kampffmeyer, and R. Jenssen, “Recurrent deep divergence-based clustering for simultaneous feature learning and clustering of variable length time series,” in International Conference on Acoustics, Speech and Signal Processing. IEEE, 2019.
- N. S. Madiraju, “Deep temporal clustering: Fully unsupervised learning of time-domain features,” Ph.D. dissertation, ASU, Arizona, 2018.
- Q. Ma, J. Zheng, S. Li, and G. W. Cottrell, “Learning representations for time series clustering,” Advances in neural information processing systems, vol. 32, 2019.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.