A Framework for dynamically meeting performance objectives on a service mesh
Abstract: We present a framework for achieving end-to-end management objectives for multiple services that concurrently execute on a service mesh. We apply reinforcement learning (RL) techniques to train an agent that periodically performs control actions to reallocate resources. We develop and evaluate the framework using a laboratory testbed where we run information and computing services on a service mesh, supported by the Istio and Kubernetes platforms. We investigate different management objectives that include end-to-end delay bounds on service requests, throughput objectives, cost-related objectives, and service differentiation. We compute the control policies on a simulator rather than on the testbed, which speeds up the training time by orders of magnitude for the scenarios we study. Our proposed framework is novel in that it advocates a top-down approach whereby the management objectives are defined first and then mapped onto the available control actions. It allows us to execute several types of control actions simultaneously. By first learning the system model and the operating region from testbed traces, we can train the agent for different management objectives in parallel.
- Istio community, “Simplify observability, traffic management, security, and policy with the leading service mesh,” 2017. [Online]. Available at: https://istio.io/, Accessed on: June 7, 2022.
- Kubernetes community, “Production-grade container orchestration,” 2014. [Online]. Available at: https://kubernetes.io/, Accessed on: June 7, 2022.
- F. S. Samani and R. Stadler, “Dynamically meeting performance objectives for multiple services on a service mesh,” in 2022 18th International Conference on Network and Service Management (CNSM). IEEE, 2022, pp. 219–225.
- L. Breiman, “Random Forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
- Sklearn communiti, “sklearn.ensemble.RandomForestClassifier,” 2007. [Online]. Available at: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html, Accessed on: June 7, 2022.
- S. Huang and S. Ontañón, “A closer look at invalid action masking in policy gradient algorithms,” arXiv preprint arXiv:2006.14171, 2020.
- Project contributers, “Maskable ppo,” [Online]. Available at: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/blob/master/docs/modules/ppo_mask.rst, Accessed on: Jan 31, 2023.
- Python, “Python,” 2000. [Online]. Available at: https://www.python.org/, Accessed on: June 7, 2022.
- Flask community, “Flask,” [Online]. Available at: https://flask.palletsprojects.com/en/2.1.x/, Accessed on: June 7, 2022.
- Kubernetes community, “Pods,” 2014. [Online]. Available at: https://kubernetes.io/, Accessed on: June 7, 2022.
- J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
- Stable Baselines3 community, “Ppo,” [Online]. Available at: https://www.python.org/, Accessed on: June 7, 2022.
- M. Morandini, L. Penserini, and A. Perini, “Towards goal-oriented development of self-adaptive systems,” in Proceedings of the 2008 international workshop on Software engineering for adaptive and self-managing systems, 2008, pp. 9–16.
- D. Weyns, M. U. Iftikhar, D. G. De La Iglesia, and T. Ahmad, “A survey of formal methods in self-adaptive systems,” in Proceedings of the Fifth International C* Conference on Computer Science and Software Engineering, 2012, pp. 67–79.
- C. Krupitzer, F. M. Roth, S. VanSyckel, G. Schiele, and C. Becker, “A survey on engineering approaches for self-adaptive systems,” Pervasive and Mobile Computing, vol. 17, pp. 184–206, 2015.
- J. O. Kephart and D. M. Chess, “The vision of autonomic computing,” Computer, vol. 36, no. 1, pp. 41–50, 2003.
- M. Parashar and S. Hariri, “Autonomic computing: An overview,” in International workshop on unconventional programming paradigms. Springer, 2004, pp. 257–269.
- S. Schneider, A. Manzoor, H. Qarawlus, R. Schellenberg, H. Karl, R. Khalili, and A. Hecker, “Self-driving network and service coordination using deep reinforcement learning,” in 2020 16th International Conference on Network and Service Management (CNSM). IEEE, 2020, pp. 1–9.
- A. Rafiq, M. Afaq, and W.-C. Song, “Intent-based networking with proactive load distribution in data center using ibn manager and smart path manager,” Journal of Ambient Intelligence and Humanized Computing, vol. 11, no. 11, pp. 4855–4872, 2020.
- A. Campanella, “Intent based network operations,” in 2019 Optical Fiber Communications Conference and Exhibition (OFC). IEEE, 2019, pp. 1–3.
- Y. Garí, D. A. Monge, E. Pacini, C. Mateos, and C. G. Garino, “Reinforcement learning-based application autoscaling in the cloud: A survey,” Engineering Applications of Artificial Intelligence, vol. 102, p. 104288, 2021.
- V. Cardellini, F. Lo Presti, M. Nardelli, and F. Rossi, “Self-adaptive container deployment in the fog: A survey,” in International Symposium on Algorithmic Aspects of Cloud Computing. Springer, 2019, pp. 77–102.
- M. Xu, C. Song, S. Ilager, S. S. Gill, J. Zhao, K. Ye, and C. Xu, “Coscal: Multi-faceted scaling of microservices with reinforcement learning,” IEEE Transactions on Network and Service Management, 2022.
- S. Schneider, R. Khalili, A. Manzoor, H. Qarawlus, R. Schellenberg, H. Karl, and A. Hecker, “Self-learning multi-objective service coordination using deep reinforcement learning,” IEEE Transactions on Network and Service Management, vol. 18, no. 3, pp. 3829–3842, 2021.
- D. Garg, N. C. Narendra, and S. Tesfatsion, “Heuristic and reinforcement learning algorithms for dynamic service placement on mobile edge cloud,” arXiv preprint arXiv:2111.00240, 2021.
- F. Rossi, V. Cardellini, and F. L. Presti, “Self-adaptive threshold-based policy for microservices elasticity,” in 2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, 2020, pp. 1–8.
- H. Qiu, S. S. Banerjee, S. Jha, Z. T. Kalbarczyk, and R. K. Iyer, “Firm: An intelligent fine-grained resource management framework for slo-oriented microservices,” in Proceedings of The 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI âÄò20), 2020.
- Z. Yang, P. Nguyen, H. Jin, and K. Nahrstedt, “Miras: Model-based reinforcement learning for microservice resource allocation over scientific workflows,” in 2019 IEEE 39th international conference on distributed computing systems (ICDCS). IEEE, 2019, pp. 122–132.
- J. Park, B. Choi, C. Lee, and D. Han, “Graf: A graph neural network based proactive resource allocation framework for slo-oriented microservices,” in Proceedings of the 17th International Conference on emerging Networking EXperiments and Technologies, 2021, pp. 154–167.
- D. Delande, P. Stolf, R. Feraud, J.-M. Pierson, and A. Bottaro, “Horizontal scaling in cloud using contextual bandits,” in Euro-Par 2021: Parallel Processing: 27th International Conference on Parallel and Distributed Computing, Lisbon, Portugal, September 1–3, 2021, Proceedings. Springer, 2021, pp. 285–300.
- M. Rajib Hossen and M. A. Islam, “A lightweight workload-aware microservices autoscaling with qos assurance,” arXiv e-prints, pp. arXiv–2202, 2022.
- G. Yu, P. Chen, and Z. Zheng, “Microscaler: Cost-effective scaling for microservice applications in the cloud with an online learning approach,” IEEE Transactions on Cloud Computing, 2020.
- F. Faticanti, F. De Pellegrini, D. Siracusa, D. Santoro, and S. Cretti, “Throughput-aware partitioning and placement of applications in fog computing,” IEEE Transactions on Network and Service Management, vol. 17, no. 4, pp. 2436–2450, 2020.
- F. Rossi, M. Nardelli, and V. Cardellini, “Horizontal and vertical scaling of container-based applications using reinforcement learning,” in 2019 IEEE 12th International Conference on Cloud Computing (CLOUD). IEEE, 2019, pp. 329–338.
- C. Ayimba, P. Casari, and V. Mancuso, “Sqlr: Short-term memory q-learning for elastic provisioning,” IEEE Transactions on Network and Service Management, vol. 18, no. 2, pp. 1850–1869, 2021.
- Y. Li, X. Zhang, T. Zeng, J. Duan, C. Wu, D. Wu, and X. Chen, “Task placement and resource allocation for edge machine learning: A gnn-based multi-agent reinforcement learning paradigm,” arXiv preprint arXiv:2302.00571, 2023.
- Y. Jiang, M. Kodialam, T. Lakshman, S. Mukherjee, and L. Tassiulas, “Resource allocation in data centers using fast reinforcement learning algorithms,” IEEE Transactions on Network and Service Management, vol. 18, no. 4, pp. 4576–4588, 2021.
- S. Wang, Y. Guo, N. Zhang, P. Yang, A. Zhou, and X. Shen, “Delay-aware microservice coordination in mobile edge computing: A reinforcement learning approach,” IEEE Transactions on Mobile Computing, vol. 20, no. 3, pp. 939–951, 2019.
- T. K.-H. Lin, “Client-centric orchestration and management of distributed applications in multi-tier clouds,” Ph.D. dissertation, University of Toronto (Canada), 2021.
- X. Hou, C. Li, J. Liu, L. Zhang, Y. Hu, and M. Guo, “Ant-man: Towards agile power management in the microservice era,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–14.
- A. Samanta and J. Tang, “Dyme: Dynamic microservice scheduling in edge computing enabled iot,” IEEE Internet of Things Journal, vol. 7, no. 7, pp. 6164–6174, 2020.
- A. F. Baarzi and G. Kesidis, “Showar: Right-sizing and efficient scheduling of microservices,” in Proceedings of the ACM Symposium on Cloud Computing, 2021, pp. 427–441.
- M. Xu, A. N. Toosi, and R. Buyya, “A self-adaptive approach for managing applications and harnessing renewable energy for sustainable cloud computing,” IEEE Transactions on Sustainable Computing, vol. 6, no. 4, pp. 544–558, 2020.
- C. Wu, Q. Peng, Y. Xia, Y. Jin, and Z. Hu, “Towards cost-effective and robust ai microservice deployment in edge computing environments,” Future Generation Computer Systems, vol. 141, pp. 129–142, 2023.
- L. Bao, C. Wu, X. Bu, N. Ren, and M. Shen, “Performance modeling and workflow scheduling of microservice-based applications in clouds,” IEEE Transactions on Parallel and Distributed Systems, vol. 30, no. 9, pp. 2114–2129, 2019.
- R. Stadler, R. Pasquini, and V. Fodor, “Learning from network device statistics,” Journal of Network and Systems Management, vol. 25, no. 4, pp. 672–698, 2017.
- C. T. Joseph, J. P. Martin, K. Chandrasekaran, and A. Kandasamy, “Fuzzy reinforcement learning based microservice allocation in cloud computing environments,” in TENCON 2019-2019 IEEE Region 10 Conference (TENCON). IEEE, 2019, pp. 1559–1563.
- K. community, “Horizontal pod autoscaling,” 2014. [Online]. Available at: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/, Accessed on: June 7, 2022.
- E. Casalicchio and V. Perciballi, “Auto-scaling of containers: The impact of relative and absolute metrics,” in 2017 IEEE 2nd International Workshops on Foundations and Applications of Self* Systems (FAS* W). IEEE, 2017, pp. 207–214.
- L. Toka, G. Dobreff, B. Fodor, and B. Sonkoly, “Adaptive ai-based auto-scaling for kubernetes,” in 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID). IEEE, 2020, pp. 599–608.
- T.-T. Nguyen, Y.-J. Yeom, T. Kim, D.-H. Park, and S. Kim, “Horizontal pod autoscaling in kubernetes for elastic container orchestration,” Sensors, vol. 20, no. 16, p. 4621, 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.